Python for quick statistical computing

February 21, 2023

Python has a lot of base packages that can be used to perform most of our statistical computing tasks. There’s a good review of what’s available when you install Python in Serious Python, by Julien Danjou (No Starch Press, 2018), and often times you don’t really need extra stuff.

Here are two small illustrations of my wanderings with Python for basic statistical stuff. For a project at work, I needed a quick implementation of bootstrapping. This was just to ensure that the statistic I was computing on the full dataset was reliable enough. I was not interested in drawing confidence intervals or anything more fancy from the bootstrapped statistics. Initially, I tried using the asyncio module, but finally I ended up using multiprocessing (see Multi-blast and parallel processing for another use of this module).

Here’s my code:

import multiprocessing

B = 500
boot = []
p = multiprocessing.Pool(processes=multiprocessing.cpu_count() - 2)
tic = time.time()
data = numpy.loadtxt(f)
n = data.shape[0]
idx = numpy.random.choice(n, (B, n))
boot.append({foo}, (data[i, :] for i in idx.tolist())))
toc = time.time()
print(f">>> Elapsed time: {toc - tic:.2f}s")

Replace {foo} with the function that returns a statistic of interest.

Of course, for serious work you would need to compute 95 or 99% percentiles as well as correct for bias and skewness of the bootstrap distribution (BCA correction).

Here is another case: a toy implementation of k-fold cross-validation. I know various libraries, including the scikit-learn, offers much more elaborated solution to the design of cross-validated statistics, but this time I was in a hurry and I just needed a quick way to rely on training/test datasets. The basic idea was just to create a partition of my dataset using numpy and a reducing function to rearrange the train and test datasets. At a glance, it looks like this:

from functools import reduce
from statistics import mean

import numpy as np

x = np.random.normal(0, 1, 100)
partition = np.array_split(x, 10)

for k in range(len(partition)):
    curr = partition[k]
    other = reduce(
        lambda x, y: np.concatenate((x, y)),
        [x for i, x in enumerate(partition) if i != k],
    print(f"curr ({k}) = {mean(curr)}, other = {mean(other)}")

Again, there’s room for improvement since usually we would first shuffle the data indices before partitioning the array in k sets.

♪ Cinderella • Don’t Know What You Got

See Also

» Processing large data: Duckdb vs. datatable » Regression diagnostics and influence measures » Data tables in Python » A grammar of graphics for Python » Manipulating images using Numpy