Analyzing Data 170,000x Faster with Python
The article, Analyzing Data 180,000x Faster with Rust, first presents some unoptimized Python code, and then shows the process of rewriting and optimizing the code in Rust, resulting in a 180,000x speed-up. The author notes:
There are lots of ways we could make the Python code faster, but the point of this post isn’t to compare highly-optimized Python to highly-optimized Rust. The point is to compare “standard-Jupyter-notebook” Python to highly-optimized Rust.
The question arises: if we were to stick with Python, what kind of speed-ups could we achieve?
In this post, we will go through a journey of profiling and iteratively speeding up the code, in Python.
Replicating the original benchmarks
The times in this post are comparable to the times reported in the original article. Using a similar computer (M1 Macbook Pro), I measure:
- 35 ms average iteration time for the original unoptimized code, measured over 1,000 iterations. The original article reports 36 ms.
- 180,081x speedup, for the fully optimized Rust code, measured over 5,000,000 iterations. The original article reports 182,450x.
Python Baseline
Here is a replication of the baseline, unoptimized Python code, from the article.
from itertools import combinations
import pandas as pd
from pandas import IndexSlice as islice
def k_corrset(data, K):
all_qs = data.question.unique()
q_to_score = data.set_index(['question', 'user'])
all_grand_totals = data.groupby('user').score.sum().rename('grand_total')
# Inner loop
corrs = []
for qs in combinations(all_qs, K):
qs_data = q_to_score.loc[islice[qs,:],:].swaplevel()
answered_all = qs_data.groupby(level=[0]).size() == K
answered_all = answered_all[answered_all].index
qs_totals = qs_data.loc[islice[answered_all,:]] \
.groupby(level=[0]).sum().rename(columns={'score': 'qs'})
r = qs_totals.join(all_grand_totals).corr().qs.grand_total
corrs.append({'qs': qs, 'r': r})
corrs = pd.DataFrame(corrs)
return corrs.sort_values('r', ascending=False).iloc[0].qs
data = pd.read_json('scores.json')
print(k_corrset(data, K=5))
And here are the first two rows of the dataframe, data
.
user | question | score |
---|---|---|
e213cc2b-387e-4d7d-983c-8abc19a586b1 | d3bdb068-7245-4521-ae57-d0e9692cb627 | 1 |
951ffaee-6e17-4599-a8c0-9dfd00470cd9 | d3bdb068-7245-4521-ae57-d0e9692cb627 | 0 |
We can use the output from the original code to test the correctness of our optimized code.
Since we are trying to optimize the the inner loop, let’s put the inner loop into its own function, to profile it using line_profiler.
Avg time per iteration: 35 ms
Speedup over baseline: 1.0x
% Time Line Contents
=====================
def compute_corrs(
qs_iter: Iterable, q_to_score: pd.DataFrame, grand_totals: pd.DataFrame
):
0.0 result = []
0.0 for qs in qs_iter:
13.5 qs_data = q_to_score.loc[islice[qs, :], :].swaplevel()
70.1 answered_all = qs_data.groupby(level=[0]).size() == K
0.4 answered_all = answered_all[answered_all].index
0.0 qs_total = (
6.7 qs_data.loc[islice[answered_all, :]]
1.1 .groupby(level=[0])
0.6 .sum()
0.3 .rename(columns={"score": "qs"})
)
7.4 r = qs_total.join(grand_totals).corr().qs.grand_total
0.0 result.append({"qs": qs, "r": r})
0.0 return result
We can see the value we are trying to optimize, (the average iteration time / speedup), as well as the proportion of time spent on each line.
This lends itself to the following workflow for optimizing the code:
- Run the profiler
- Identify the slowest lines
- Try make to the slower lines faster
- Repeat
If there are just a few lines taking up the majority of the time, we know what to focus on, and from the above we see that there’s a particularly slow line, taking up ~70% of the time.
However, there is another vital step to include:
- Test the output for correctness
- Run the profiler
- Identify the slowest lines
- Try make to the slower lines faster
- Repeat
The tests help one to experiment, to try out different methods, libraries, etc. while knowing that any accidental changes to what is being computed will be caught.
Optimization 1 - dictionary of sets of users who answered questions, users_who_answered_q
The baseline carries out various heavy Pandas operations, to find out which users answered the current set of questions, qs
. In particular, it checks every row of the dataframe to find out which users answered the questions. For the first optimization, instead of using the full dataframe, we can use a dictionary of sets. This lets us quickly look up which users answered each question in qs
, and use Python’s set intersection to find out which users anwered all of the questions.
Avg time per iteration: 10.0 ms
Speedup over baseline: 3.5x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, q_to_score, grand_totals):
0.0 result = []
0.0 for qs in qs_iter:
0.0 user_sets_for_qs = [users_who_answered_q[q] for q in qs]
3.6 answered_all = set.intersection(*user_sets_for_qs)
40.8 qs_data = q_to_score.loc[islice[qs, :], :].swaplevel()
0.0 qs_total = (
22.1 qs_data.loc[islice[list(answered_all), :]]
3.7 .groupby(level=[0])
1.9 .sum()
1.1 .rename(columns={"score": "qs"})
)
26.8 r = qs_total.join(grand_totals).corr().qs.grand_total
0.0 result.append({"qs": qs, "r": r})
0.0 return result
This significantly speeds up the lines that compute, answered_all
, which have gone from taking up 70% of the time, to 4%, and we are already over 3x faster than the baseline.
Optimization 2 - score_dict dictionary
If we add up the amount of time spent on each line that contributes to computing qs_total
, (including the qs_data
line), it comes to ~65%; so the next thing to optimize is clear. We can again switch out heavy operations on the full dataset, (indexing, grouping, etc.) with fast dictionary look ups. We introduce score_dict
, a dictionary that lets us look up the score for a given question and user pair.
Avg time per iteration: 690 μs
Speedup over baseline: 50.8x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_dict, grand_totals):
0.0 result = []
0.0 for qs in qs_iter:
0.1 user_sets_for_qs = [users_who_answered_q[q] for q in qs]
35.9 answered_all = set.intersection(*user_sets_for_qs)
3.4 qs_total = {u: sum(score_dict[q, u] for q in qs) for u in answered_all}
8.6 qs_total = pd.DataFrame.from_dict(qs_total, orient="index", columns=["qs"])
0.1 qs_total.index.name = "user"
51.8 r = qs_total.join(grand_totals).corr().qs.grand_total
0.0 result.append({"qs": qs, "r": r})
0.0 return result
This gives us a nice 50x speed up.
Optimization 3 - grand_totals dictionary, and np.corrcoef
The slowest line above does multiple things, it does a Pandas join, to combine the grand_totals
, with the qs_total
, and then it computes the correlation coefficient for this. Again, we can speed this up by using a dictionary lookup instead of a join, and since we no longer have Pandas objects, we use np.corrcoef
instead of Pandas corr
.
Avg time per iteration: 380 μs
Speedup over baseline: 91.6x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_dict, grand_totals):
0.0 result = []
0.0 for qs in qs_iter:
0.2 user_sets_for_qs = [users_who_answered_q[q] for q in qs]
83.9 answered_all = set.intersection(*user_sets_for_qs)
7.2 qs_total = [sum(score_dict[q, u] for q in qs) for u in answered_all]
0.5 user_grand_total = [grand_totals[u] for u in answered_all]
8.1 r = np.corrcoef(qs_total, user_grand_total)[0, 1]
0.1 result.append({"qs": qs, "r": r})
0.0 return result
This gives us a ~90x speedup.
Optimization 4 - uuid strings to ints
The next optimization doesn’t alter the code in the inner loop at all. But it does speed up some of the operations. We replace the long user/question uuids, (e.g. e213cc2b-387e-4d7d-983c-8abc19a586b1
), with, much shorter, ints. How it’s done:
data.user = data.user.map({u: i for i, u in enumerate(data.user.unique())})
data.question = data.question.map(
{q: i for i, q in enumerate(data.question.unique())}
)
And we measure:
Avg time per iteration: 210 μs
Speedup over baseline: 168.5x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_dict, grand_totals):
0.0 result = []
0.1 for qs in qs_iter:
0.4 user_sets_for_qs = [users_who_answered_q[q] for q in qs]
71.6 answered_all = set.intersection(*user_sets_for_qs)
13.1 qs_total = [sum(score_dict[q, u] for q in qs) for u in answered_all]
0.9 user_grand_total = [grand_totals[u] for u in answered_all]
13.9 r = np.corrcoef(qs_total, user_grand_total)[0, 1]
0.1 result.append({"qs": qs, "r": r})
0.0 return result
Optimization 5 - np.bool_ array instead of sets of users
We can see that the set operation above is still the slowest line. Instead of using sets of ints, we switch to using a np.bool_
array of users, and use np.logical_and.reduce
to find the users that answered all of the questions in qs
. (Note that np.bool_
uses a whole byte for each element, but np.logical_and.reduce
is still pretty fast.) This gives a signicant speedup:
Avg time per iteration: 75 μs
Speedup over baseline: 466.7x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_dict, grand_totals):
0.0 result = []
0.1 for qs in qs_iter:
12.0 user_sets_for_qs = users_who_answered_q[qs, :] # numpy indexing
9.9 answered_all = np.logical_and.reduce(user_sets_for_qs)
10.7 answered_all = np.where(answered_all)[0]
33.7 qs_total = [sum(score_dict[q, u] for q in qs) for u in answered_all]
2.6 user_grand_total = [grand_totals[u] for u in answered_all]
30.6 r = np.corrcoef(qs_total, user_grand_total)[0, 1]
0.2 result.append({"qs": qs, "r": r})
0.0 return result
Optimization 6 - score_matrix instead of dict
The slowest line above is now the computation of qs_total
. Following the example of the original article, we switch to using a dense np.array to look up the scores, instead of a dictionary, and use fast NumPy indexing to get the scores.
Avg time per iteration: 56 μs
Speedup over baseline: 623.7x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_matrix, grand_totals):
0.0 result = []
0.2 for qs in qs_iter:
16.6 user_sets_for_qs = users_who_answered_q[qs, :]
14.0 answered_all = np.logical_and.reduce(user_sets_for_qs)
14.6 answered_all = np.where(answered_all)[0]
7.6 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
3.9 user_grand_total = [grand_totals[u] for u in answered_all]
42.7 r = np.corrcoef(qs_total, user_grand_total)[0, 1]
0.4 result.append({"qs": qs, "r": r})
0.0 return result
Optimization 7 - custom corrcoef
The slowest line above is np.corrcoef
… We will do what it takes to optimize our code, so here’s our own corrcoef implementation, that’s twice as fast for this use case:
def corrcoef(a: list[float], b: list[float]) -> float | None:
"""same as np.corrcoef(a, b)[0, 1]"""
n = len(a)
sum_a = sum(a)
sum_b = sum(b)
sum_ab = sum(a_i * b_i for a_i, b_i in zip(a, b))
sum_a_sq = sum(a_i**2 for a_i in a)
sum_b_sq = sum(b_i**2 for b_i in b)
num = n * sum_ab - sum_a * sum_b
den = sqrt(n * sum_a_sq - sum_a**2) * sqrt(n * sum_b_sq - sum_b**2)
if den == 0:
return None
return num / den
And we get a decent speed up:
Avg time per iteration: 43 μs
Speedup over baseline: 814.6x
% Time Line Contents
=====================
def compute_corrs(qs_iter, users_who_answered_q, score_matrix, grand_totals):
0.0 result = []
0.2 for qs in qs_iter:
21.5 user_sets_for_qs = users_who_answered_q[qs, :] # numpy indexing
18.7 answered_all = np.logical_and.reduce(user_sets_for_qs)
19.7 answered_all = np.where(answered_all)[0]
10.0 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
5.3 user_grand_total = [grand_totals[u] for u in answered_all]
24.1 r = corrcoef(qs_total, user_grand_total)
0.5 result.append({"qs": qs, "r": r})
0.0 return result
Optimization 8 - Premature introduction of Numba
We haven’t finished optimizing the data structures in the code above, but let’s see what would happen if we were to introduce Numba at this stage. Numba is a library in the Python ecosystem that “translates a subset of Python and NumPy code into fast machine code”.
In order to be able to use Numba, we make two changes:
Modification 1: Pass qs_combinations as numpy array, instead of qs_iter
Numba doesn’t play well with itertools
or generators, so we turn qs_iter
into a NumPy array in advance, to give to the function. The impact of this change on the time, (before adding Numba), is shown below.
Avg time per iteration: 42 μs
Speedup over baseline: 829.2x
Modification 2: Result array instead of list
Rather than appending to a list, we initialise an array, and put the results in it. The impact of this change on the time, (before adding Numba), is shown below.
Avg time per iteration: 42 μs
Speedup over baseline: 833.8x
The code ends up looking like this:
import numba
@numba.njit(parallel=False)
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
result = np.empty(len(qs_combinations), dtype=np.float64)
for i in numba.prange(len(qs_combinations)):
qs = qs_combinations[i]
user_sets_for_qs = users_who_answered_q[qs, :]
# numba doesn't support np.logical_and.reduce
answered_all = user_sets_for_qs[0]
for j in range(1, len(user_sets_for_qs)):
answered_all *= user_sets_for_qs[j]
answered_all = np.where(answered_all)[0]
qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
user_grand_total = grand_totals[answered_all]
result[i] = corrcoef_numba(qs_total, user_grand_total)
return result
(Note that we also decorated corrcoef
with Numba, because the functions called within a Numba function also need to have been compiled.)
Results, with parallel=False
Avg time per iteration: 47 μs
Speedup over baseline: 742.2x
Results, with parallel=True
Avg time per iteration: 8.5 μs
Speedup over baseline: 4142.0x
We see that with parallel=False
the Numba code is slightly slower than the previous Python code, but when we turn on the parallelism, we start making use of all of our CPU cores (10 on the machine running the benchmarks), which gives a good speed multiplier.
However, we lose the ability to use line_profiler, on the JIT compiled code; (we might want to start looking at the generated LLVM IR / assembly).
Optimization 9 - Bitsets, no Numba
Let’s put Numba aside for now. The original article uses bitsets to quickly compute the users who answered the current qs
, so let’s see if that will work for us. We can use NumPy arrays of np.int64
, and np.bitwise_and.reduce
, to implement bitsets. This is different from the np.bool_
array we used before, because we are now using the individual bits within a byte, to represent the entities within a set. Note that we might need multiple bytes for a given bitset, depending on the max number of elements that we need. We can use fast bitwise_and on the bytes of each question in qs
to find the set intersection, and therefore the number of users who answered all the qs
.
Here are the bitset
functions we’ll use:
def bitset_create(size):
"""Initialise an empty bitset"""
size_in_int64 = int(np.ceil(size / 64))
return np.zeros(size_in_int64, dtype=np.int64)
def bitset_add(arr, pos):
"""Add an element to a bitset"""
int64_idx = pos // 64
pos_in_int64 = pos % 64
arr[int64_idx] |= np.int64(1) << np.int64(pos_in_int64)
def bitset_to_list(arr):
"""Convert a bitset back into a list of ints"""
result = []
for idx in range(arr.shape[0]):
if arr[idx] == 0:
continue
for pos in range(64):
if (arr[idx] & (np.int64(1) << np.int64(pos))) != 0:
result.append(idx * 64 + pos)
return np.array(result)
And we can initialize the bitsets as follows:
users_who_answered_q = np.array(
[bitset_create(data.user.nunique()) for _ in range(data.question.nunique())]
)
for q, u in data[["question", "user"]].values:
bitset_add(users_who_answered_q[q], u)
Let’s see the speedup we get:
Avg time per iteration: 550 μs
Speedup over baseline: 64.2x
% Time Line Contents
=====================
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
0.0 num_qs = qs_combinations.shape[0]
0.0 bitset_size = users_who_answered_q[0].shape[0]
0.0 result = np.empty(qs_combinations.shape[0], dtype=np.float64)
0.0 for i in range(num_qs):
0.0 qs = qs_combinations[i]
0.3 user_sets_for_qs = users_who_answered_q[qs_combinations[i]]
0.4 answered_all = np.bitwise_and.reduce(user_sets_for_qs)
96.7 answered_all = bitset_to_list(answered_all)
0.6 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
0.0 user_grand_total = grand_totals[answered_all]
1.9 result[i] = corrcoef(qs_total, user_grand_total)
0.0 return result
It looks like we’ve regressed somewhat, with the bitset_to_list
operation taking up a lot of time.
Optimization 10 - Numba on bitset_to_list
Let’s convert bitset_to_list
into compiled code. To do this we can add a Numba decorator:
@numba.njit
def bitset_to_list(arr):
result = []
for idx in range(arr.shape[0]):
if arr[idx] == 0:
continue
for pos in range(64):
if (arr[idx] & (np.int64(1) << np.int64(pos))) != 0:
result.append(idx * 64 + pos)
return np.array(result)
And let’s measure this:
Benchmark #14: bitsets, with numba on bitset_to_list
Using 1000 iterations...
Avg time per iteration: 19 μs
Speedup over baseline: 1801.2x
% Time Line Contents
=====================
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
0.0 num_qs = qs_combinations.shape[0]
0.0 bitset_size = users_who_answered_q[0].shape[0]
0.0 result = np.empty(qs_combinations.shape[0], dtype=np.float64)
0.3 for i in range(num_qs):
0.6 qs = qs_combinations[i]
8.1 user_sets_for_qs = users_who_answered_q[qs_combinations[i]]
11.8 answered_all = np.bitwise_and.reduce(user_sets_for_qs)
7.7 answered_all = bitset_to_list(answered_all)
16.2 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
1.1 user_grand_total = grand_totals[answered_all]
54.1 result[i] = corrcoef(qs_total, user_grand_total)
0.0 return result
We’ve got an 1,800x speed up over the original code. Recall that optimization 7, before Numba was introduced, got 814x. (Optimization 8 got 4142x, but that was with parallel=True
on the inner loop, so it’s not comparible to the above.)
Optimization 11 - Numba on corrcoef
The corrcoef line is again standing out as slow above. Let’s use corrcoef
decorated with Numba.
@numba.njit
def corrcoef_numba(a, b):
"""same as np.corrcoef(a, b)[0, 1]"""
n = len(a)
sum_a = sum(a)
sum_b = sum(b)
sum_ab = sum(a * b)
sum_a_sq = sum(a * a)
sum_b_sq = sum(b * b)
num = n * sum_ab - sum_a * sum_b
den = math.sqrt(n * sum_a_sq - sum_a**2) * math.sqrt(n * sum_b_sq - sum_b**2)
return np.nan if den == 0 else num / den
And benchmark:
Avg time per iteration: 11 μs
Speedup over baseline: 3218.9x
% Time Line Contents
=====================
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
0.0 num_qs = qs_combinations.shape[0]
0.0 bitset_size = users_who_answered_q[0].shape[0]
0.0 result = np.empty(qs_combinations.shape[0], dtype=np.float64)
0.7 for i in range(num_qs):
1.5 qs = qs_combinations[i]
15.9 user_sets_for_qs = users_who_answered_q[qs_combinations[i]]
26.1 answered_all = np.bitwise_and.reduce(user_sets_for_qs)
16.1 answered_all = bitset_to_list(answered_all)
33.3 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
2.0 user_grand_total = grand_totals[answered_all]
4.5 result[i] = corrcoef_numba(qs_total, user_grand_total)
0.0 return result
Nice, another big speedup.
Optimization 12 - Numba on bitset_and
Instead of using np.bitwise_and.reduce
, we introduce bitwise_and
, and jit compile it.
@numba.njit
def bitset_and(arrays):
result = arrays[0].copy()
for i in range(1, len(arrays)):
result &= arrays[i]
return result
Benchmark #16: numba also on bitset_and
Using 1000 iterations...
Avg time per iteration: 8.9 μs
Speedup over baseline: 3956.7x
% Time Line Contents
=====================
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
0.1 num_qs = qs_combinations.shape[0]
0.0 bitset_size = users_who_answered_q[0].shape[0]
0.1 result = np.empty(qs_combinations.shape[0], dtype=np.float64)
1.0 for i in range(num_qs):
1.5 qs = qs_combinations[i]
18.4 user_sets_for_qs = users_who_answered_q[qs_combinations[i]]
16.1 answered_all = bitset_and(user_sets_for_qs)
17.9 answered_all = bitset_to_list(answered_all)
37.8 qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
2.4 user_grand_total = grand_totals[answered_all]
4.8 result[i] = corrcoef_numba(qs_total, user_grand_total)
0.0 return result
Optimization 13 - Numba on the whole function
The above is now considerably faster than the original code, with the computation spread fairly evenly out among a few lines in the loop. In fact, it looks like the slowest line is carrying out NumPy indexing, which is already pretty fast. So, let’s compile the whole function with Numba.
@numba.njit(parallel=False)
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
result = np.empty(len(qs_combinations), dtype=np.float64)
for i in numba.prange(len(qs_combinations)):
qs = qs_combinations[i]
user_sets_for_qs = users_who_answered_q[qs, :]
answered_all = user_sets_for_qs[0]
# numba doesn't support np.logical_and.reduce
for j in range(1, len(user_sets_for_qs)):
answered_all *= user_sets_for_qs[j]
answered_all = np.where(answered_all)[0]
qs_total = score_matrix[answered_all, :][:, qs].sum(axis=1)
user_grand_total = grand_totals[answered_all]
result[i] = corrcoef_numba(qs_total, user_grand_total)
return result
Avg time per iteration: 4.2 μs
Speedup over baseline: 8353.2x
And now with parallel=True
:
Avg time per iteration: 960 ns
Speedup over baseline: 36721.4x
Ok, nice we are 36,000 times faster than the original code.
Optimization 14 - Numba, inline with accumulation instead of arrays
Where do we go from here?… Well, in the code above there’s still a fair amount of putting values into arrays, and then passing them around. Since we are are making the effort to optimize this code, we can look at the way corrcoef is computed, and realise that we don’t need to build up the arrays answered_all
, and user_grand_total
, we can instead accumulate the values, as we loop.
And here’s the code (we’ve also enabled some compiler optimizations, like disabling boundschecking
of arrays, and enabling fastmath
).
@numba.njit(boundscheck=False, fastmath=True, parallel=False, nogil=True)
def compute_corrs(qs_combinations, users_who_answered_q, score_matrix, grand_totals):
num_qs = qs_combinations.shape[0]
bitset_size = users_who_answered_q[0].shape[0]
corrs = np.empty(qs_combinations.shape[0], dtype=np.float64)
for i in numba.prange(num_qs):
# bitset will contain users who answered all questions in qs_array[i]
bitset = users_who_answered_q[qs_combinations[i, 0]].copy()
for q in qs_combinations[i, 1:]:
bitset &= users_who_answered_q[q]
# retrieve stats for the users to compute correlation
n = 0.0
sum_a = 0.0
sum_b = 0.0
sum_ab = 0.0
sum_a_sq = 0.0
sum_b_sq = 0.0
for idx in range(bitset_size):
if bitset[idx] != 0:
for pos in range(64):
if (bitset[idx] & (np.int64(1) << np.int64(pos))) != 0:
user_idx = idx * 64 + pos
score_for_qs = 0.0
for q in qs_combinations[i]:
score_for_qs += score_matrix[user_idx, q]
score_for_user = grand_totals[user_idx]
n += 1.0
sum_a += score_for_qs
sum_b += score_for_user
sum_ab += score_for_qs * score_for_user
sum_a_sq += score_for_qs * score_for_qs
sum_b_sq += score_for_user * score_for_user
num = n * sum_ab - sum_a * sum_b
den = np.sqrt(n * sum_a_sq - sum_a**2) * np.sqrt(n * sum_b_sq - sum_b**2)
corrs[i] = np.nan if den == 0 else num / den
return corrs
We start with parallel=False
.
Avg time per iteration: 1.7 μs
Speedup over baseline: 20850.5x
This should be compared to optimization 12 with parallel=False
, which measured as 8353x.
Now, with parallel=True
.
Avg time per iteration: 210 ns
Speedup over baseline: 170476.3x
Nice, we’ve got to 170,000x the speed of the Python baseline.
Conclusion
We’ve been able to get most of the things that made the optimized Rust code fast, notably, bitsets, SIMD, and loop-level parallelism, thanks to Numba and NumPy. First, we made the original Python code considerably faster, with a few helper functions JIT compiled, but in the end we JITed the whole thing, and optimized the code for that. We took a trial and improvement approach, using profiling to focus our efforts on the slowest lines of code. We showed that we can use Numba to gradually mix JIT compiled code into our Python codebase. We can drop this code into our existing Python codebase immediately. However, we didn’t get to the 180,000x speed up of the optimized Rust code, and we rolled our own correlation and bitsets implementation, whereas the Rust code was able to use libraries for these, while remaining fast.
This was a fun exercise, that hopefully shows off some useful tools in the Python ecosystem.
Would I recommend one approach over the other? No, it depends on the situation.
Notes
The full code is here, on GitHub.