GitHub - MintonGroup/Multiprocessing-Pool

Multiprocessing-Pool

Multiprocessing is a python package that involves process-based parallelism, allowing the user to fully leverage multiple processors on a given machine (great for CPU bound processing). Using subprocesses instead of threads, it is able to side-step the Python Global interpreter lock (this lock only allows one thread to be executed at any point in time). Multiprocessing allows each process to have a separate memory space. This helps the speed up of code; however, it also means the code will have a larger memory footprint for larger iterables.

Multiprocessing allows for the use of Pool and Process class objects and can be used to implement MapReduce-like models.

MapReduce

MapReduce is a programming model used to process big data through parallel, distributed algorithms. As the name describes, this model carries out map and reduce procedures, i.e. split-and-combine methods. Mapping out the data includes filtering and sorting, taking an input and producing key/value pairs. Values associated with the same key are grouped together. These are then passed to the reduce function where these grouped results are combined together. In general, map is the process of splitting inputs among machines/processes and reduce takes these results and aggregates them.

In relation to multiprocessing Pool, Pool splits the execution of tasks through "workers" for parallel processing and then combines the results in a list (further explained below).

The Pool Class

The Pool class enhances parallelism capabilities by executing a function across multiple input values and spreads the computations across multiple CPU cores. With Pool, you use a "pool" of worker processes. For most CPU bound tasks, the number of processors does not usually need to exceed the number of cores available; however, this is not always the case. This number of "workers" or processes relates to the number of child processes (a.k.a subprocess or subtask) that are forked from the parent process. A single parent process can have more than one child process that will run concurrently.

To find the number of cores in the virtual environment, you can use (Note: This number does not always correlate to the number of processes you can actually use):

os.cpu_count()

To determine the number of CPUs in the current process (number of cores requested), i.e. the number of usable CPUs, use

len(os.sched_getaffinity(0))    # 0 is the current process

In code, you start with a process/program. Multiprocessing supports three different ways to start a process: spawn, fork, and forkserver, where spawn is the default on Windows and MacOS and fork is the default on Unix. Using the default for Unix, a parent process forks the python interpreter to create child processes that are identical to the parent process. A parent process can have multiple child processes but a child process can only have one parent. Child processes are created when there is a need to perform more functions simultaneously. The number of processes does not entirely depend on the number of cores available, but the number of cores still effects the efiiciency and how well the child processes will perform. Thus, the number of child processes to input for Pool varies and it is best to test what number is most efficient for your code. Once the processes finish, it is vital to call pool.close() and pool.join.

Pool.close ensures that there are no new processes being accepted. In other words, you call this when there is no more work that needs to be processed by Pool (Must be called before join). Pool.join waits for all the processes to finish and then returns the result. If these are not included, there could be risk factors like a memory leak.

To better portray how Pool operates and forks processes, see the figure below. In this diagram, 5 cores were requested and the number of processes input into Pool was 10, to create 10 child processes. 2 child processes are then mapped to run concurrently on one core.

Child processes are represented by 'c'.

Map()

The Pool class is used in conjunction with the map() function. Map() allows you to process an iterable through functions created. Using map() with Pool runs the function and iterable in parallel. The basic syntax for the function is as follows:

map(function, iterable)

This returns a list filled with data of the function applied to each item in the iterable.

Setting Up an Environment

When using multiprocessing in an environment, there seems to be a pickling error when all the packages are installed together. This error is avoided when each package is installed individually for some reason. For example, to run multiprocessing with Swiftest, the following packages were installed one-by-one: xarray, astroquery, swiftest, netCDF4, and its dependencies python (3.8 or later), numpy (1.18 or later), packaging (20.0 or later), and pandas (1.1 or later).

To reference how to make an environment, see section in google doc to help with navigating the Cluster: https://docs.google.com/document/d/1fCHuFuEf9qb5JuljPHDtfAYNUNVZzELZSuAY8kSd0LA/edit?usp=sharing

Prelimary Results for Multiprocessing Pool Memory Usage and Runtime

These results are evaluating the swiftest simulation using sim103 and sim110 for the Hungaria runs.

Sim110:

Clones	Number of Processes	Average Memory Usage (Gb)	Average Runtime (minutes)
96	24	16.37	9.12
96	48	21.06	5.12
96	72	24.81	5.23
96	96	28.33	3.64

Sim103 (lower output cadence runs):

Clones	Number of Processes	Average Memory Usage (Gb)	Average Runtime (minutes)
96	24	4.71	0.35
96	48	9.14	0.33
96	72	14.54	0.49
96	96	17.14	0.35

README.md

Multiprocessing-Pool

MapReduce

The Pool Class

Map()

Setting Up an Environment

Prelimary Results for Multiprocessing Pool Memory Usage and Runtime

Sim110:

Sim103 (lower output cadence runs):

About

Releases

Languages

MintonGroup/Multiprocessing-Pool

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

Multiprocessing-Pool

MapReduce

The Pool Class

Map()

Setting Up an Environment

Prelimary Results for Multiprocessing Pool Memory Usage and Runtime

Sim110:

Sim103 (lower output cadence runs):

About

Resources

Stars

Watchers

Forks

Releases

Languages