Parallelization and profiling#
If you’re one of those people whose scripts always run in a second or less, you can probably skip this tutorial. But if you have time to make yourself a cup of tea while your code is running, you might want to read on. This tutorial covers how to run code in parallel, and how to check its performance to look for improvements.
Click here to open an interactive version of this notebook.
Parallelization#
Parallelization in Python#
Scary stories of Python’s “global interpreter lock” aside, parallelization is actually fairly simple in Python. However, it’s not particularly intuitive or flexible. We can do vanilla parallelization in Python via something like this:
[1]:
import multiprocessing as mp
# Define a function
def my_func(x):
return x**2
# Run it in parallel
with mp.Pool() as pool:
results = pool.map(my_func, [1,2,3])
print(results)
[1, 4, 9]
So far so good. But what if we have something more complicated? What if we want to run our function with a different keyword argument, for example? It starts getting kind of crazy:
[2]:
from functools import partial
# Define a (slightly) more complex function
def complex_func(x, arg1=2, arg2=4):
return x**2 + (arg1 * arg2)
# Make a new function with a different default argument 😱
new_func = partial(complex_func, arg2=10)
# Run it in parallel
with mp.Pool() as pool:
results = pool.map(new_func, [1,2,3])
print(results)
[21, 24, 29]
This works, but that sure was a lot of work just to set a single keyword argument!
Parallelization in Sciris#
With Sciris, you can do it all with one line:
[3]:
import sciris as sc
results = sc.parallelize(complex_func, [1,2,3], arg2=10)
print(results)
[21, 24, 29]
What’s happening here? sc.parallelize()
lets you pass keyword arguments directly to the function you’re calling. You can also iterate over multiple arguments rather than just one:
[4]:
args = dict(x=[1,2,3], arg2=[10,20,30])
results = sc.parallelize(complex_func, iterkwargs=args)
print(results)
[21, 44, 69]
(Of course you can do this with vanilla Python too, but you’ll need to define a list of tuples, and you can only assign by position, not by keyword.)
Depending on what you might want to run, your inputs might be in one of several different forms. You can supply a list of values, a list of dicts, or a dict of lists. An example will probably help:
[5]:
def mult(x,y):
return x*y
r1 = sc.parallelize(mult, iterarg=[(1,2),(2,3),(3,4)])
r2 = sc.parallelize(mult, iterkwargs={'x':[1,2,3], 'y':[2,3,4]})
r3 = sc.parallelize(mult, iterkwargs=[{'x':1, 'y':2}, {'x':2, 'y':3}, {'x':3, 'y':4}])
print(f'{r1 = }')
print(f'{r2 = }')
print(f'{r3 = }')
r1 = [2, 6, 12]
r2 = [2, 6, 12]
r3 = [2, 6, 12]
All of these are equivalent: choose whichever makes you happy.
Advanced usage#
There are lots and lots of options with parallelization, but we’ll only cover a couple here. For example, if you want to start 200 jobs on your laptop with 8 cores, you probably don’t want them to eat up all your CPU or memory and make your computer unusable. You can set maxcpu
and maxmem
limits to handle that:
[6]:
import numpy as np
import matplotlib.pyplot as plt
# Define the function
def rand2d(i, x, y):
np.random.seed()
xy = [x+i*np.random.randn(100), y+i*np.random.randn(100)]
return (i,xy)
# Run in parallel
xy = sc.parallelize(
func = rand2d, # The function to parallelize
iterarg = range(5), # Values for first argument
maxcpu = 0.8, # CPU limit (1 = no limit)
maxmem = 0.9, # Memory limit (1 = no limit)
interval = 0.2, # How often to re-check the limits (in seconds)
x = 3, y = 8, # Keyword arguments for the function
)
# Plot
plt.figure()
colors = sc.gridcolors(len(xy))
for i,(x,y) in reversed(xy): # Reverse order to plot the most widely spaced dots first
plt.scatter(x, y, c=[colors[i]], alpha=0.7, label=f'Scale={i}')
plt.legend();
CPU ✓ (0.00<0.80), memory ✓ (0.23<0.90): starting process 0 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.23<0.90): starting process 1 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.23<0.90): starting process 2 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.23<0.90): starting process 3 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.23<0.90): starting process 4 after 1 tries
So far, we’ve used sc.parallelize()
as a function. But you can also use it as a class, which gives you more flexibility and control over which jobs are run, and will give you more information if any of them failed:
[7]:
def slow_func(i=1):
sc.randsleep(seed=i)
if i == 4:
raise Exception("I don't like seed 4")
return i**2
# Create the parallelizer object
P = sc.Parallel(
func = slow_func,
iterarg = range(10),
parallelizer = 'multiprocess-async', # Run asynchronously
die = False, # Keep going if a job crashes
)
# Actually run
P.run_async()
# Monitor progress
P.monitor()
# Get results
P.finalize()
# See how long things took
print(P.times)
Job 4/10 (2.3 s) ••••••••••••—————————————————— 40%
/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/multiprocess/pool.py:48: RuntimeWarning: sc.parallelize(): Task 4 failed, but die=False so continuing.
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/sciris/sc_parallel.py", line 832, in _task
result = func(*args, **kwargs) # Call the function!
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ipykernel_1857/2785706684.py", line 4, in slow_func
raise Exception("I don't like seed 4")
Exception: I don't like seed 4
return list(map(*args))
#0. 'started': datetime.datetime(2024, 9, 24, 22, 22, 7, 60623)
#1. 'finished': datetime.datetime(2024, 9, 24, 22, 22, 13, 669639)
#2. 'elapsed': 6.609016
#3. 'jobs': [1.2763750553131104, 1.0244176387786865, 0.5254466533660889,
0.17209696769714355, 1.8918840885162354, 1.6109259128570557, 1.0778048038482666,
1.250934362411499, 0.6549654006958008, 1.7412195205688477]
/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/sciris/sc_parallel.py:539: RuntimeWarning: Only 9 of 10 jobs succeeded; see exceptions attribute for details
self.process_results()
You can see it raised some warnings. These are stored in the Parallel
object so we can check back and see what happened:
[8]:
print(f'{P.success = }')
print(f'{P.exceptions = }')
print(f'{P.results = }')
P.success = [True, True, True, True, False, True, True, True, True, True]
P.exceptions = [None, None, None, None, Exception("I don't like seed 4"), None, None, None, None, None]
P.results = [0, 1, 4, 9, None, 25, 36, 49, 64, 81]
Hopefully, you will never need to run a function as poorly written as slow_func()
!
Profiling#
Even parallelization can’t save you if your code is just really slow. Sciris provides a variety of tools to help with this.
Benchmarking#
First off, we can check if our computer is performing as we expect, or if we want to compare across computers:
[9]:
bm = sc.benchmark() # Check CPU performance, in units of MOPS (million operations per second)
ml = sc.memload() # Check total memory load
ram = sc.checkram() # Check RAM used by this Python instance
print('CPU performance: ', dict(bm))
print('System memory load', ml)
print('Python RAM usage', ram)
CPU performance: {'python': np.float64(5.011328770105726), 'numpy': np.float64(137.56078522689816)}
System memory load 0.23800000000000002
Python RAM usage 154.10 MB
We can see that NumPy performance is much higher than Python – hundreds of MOPS† instead of single-digits. This makes sense, this is why we use it for array operations!
† The determination of a single “operation” is a little loose, so these “MOPS” can be used for relative purposes, but aren’t directly relatable to, say, published processor speeds.
Line profiling#
If you want to do a serious profiling of your code, take a look at Austin. But if you just want to get a quick sense of where things might be slow, you can use sc.profile()
. Applying it to our lousy slow_func()
from before:
[10]:
sc.profile(slow_func)
Profiling 1 function(s):
<function slow_func at 0x7fa39d4145e0>
Timer unit: 1e-09 s
Total time: 1.02393 s
File: /tmp/ipykernel_1857/2785706684.py
Function: slow_func at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def slow_func(i=1):
2 1 1023928908.0 1e+09 100.0 sc.randsleep(seed=i)
3 1 1872.0 1872.0 0.0 if i == 4:
4 raise Exception("I don't like seed 4")
5 1 1404.0 1404.0 0.0 return i**2
Done.
[10]:
<line_profiler.line_profiler.LineProfiler at 0x7fa39d6de0b0>
We can see that 100% (well, 99.9997%) of the time was taken by the sleep function. This is not surprising, but seems correct!
For a slightly more realistic example:
[11]:
def func():
n = 1000
# Do some NumPy
v1 = np.random.rand(n,n)
v2 = np.random.rand(n,n)
v3 = v1*v2
# Do some Python
means = []
for i in range(n):
means.append(sum(v3[i])/n)
sc.profile(func)
Profiling 1 function(s):
<function func at 0x7fa39d6be0c0>
Timer unit: 1e-09 s
Total time: 0.103874 s
File: /tmp/ipykernel_1857/701805461.py
Function: func at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def func():
2 1 508.0 508.0 0.0 n = 1000
3
4 # Do some NumPy
5 1 9812817.0 1e+07 9.4 v1 = np.random.rand(n,n)
6 1 9214957.0 9e+06 8.9 v2 = np.random.rand(n,n)
7 1 3831367.0 4e+06 3.7 v3 = v1*v2
8
9 # Do some Python
10 1 500.0 500.0 0.0 means = []
11 1001 224724.0 224.5 0.2 for i in range(n):
12 1000 80789086.0 80789.1 77.8 means.append(sum(v3[i])/n)
Done.
[11]:
<line_profiler.line_profiler.LineProfiler at 0x7fa39d6df930>
We can see (from the “% Time
” column) that, again not surprisingly, the Python math operation is much slower than the NumPy operations.