Python for statistical computing

2011-02-07

Pursuant on my previous post on the use of Lisp for statistical computing, here are some links for statistics with Python.

Most of the packages listed hereafter have been grabbed on stats.stackexchange.com and MetaOptimize.

The two core packages obviously are NumPy and SciPy, which provides infrastructure for handling N-dimensional array object, tools for doing numerical stuff à la Matlab. Combined to Matplotlib, we have a complete scientific numerical platform. The SciPy package already includes some common routines for statistical analysis, but see the Cookbook which collates some worked examples of commonly-done tasks.

The cool thing is that we can benefit from R built-in commands by just using RPy, see e.g. Using Python (and R) to calculate Linear Regressions.

Other packages of interest:

  • pandas (Pythonic cross-section, time series, and statistical analysis), is a set of fast NumPy-based data structures optimized for panel, time series, and cross-sectional data analysis, with an emphasis on econometric applications.
  • larry, provides labelled arrays + additional functions, such as such as movingsum, ranking, merge, shuffle, zscore, demean, ...
  • python-statlib, provides basic descriptive statistics for the python programming language.
  • scikits.statsmodels, implements common statistical model (OLS/GLS, GLM, M-estimators, etc.). I really like this package, the syntax is clean and it feels like we didn't leave R.
  • scikits.learn, is a full-featured toolbox for machine learning (under active development), with a pretty nice documentation; it actually includes some common techniques for supervised learning with penalization/regularization methods (classification and regression).
  • PyMC, is useful for bayesian estimation using MCMC methods and offers nice plotting utilities, in addition to its core set of algorithms for estimation of Bayesian models.
  • Theano, provides a library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
  • mlpy, another ML library for preprocessing, clustering, predictive classification, regression and feature selection.
  • Bolt (Bolt Online Learning Toolbox), features discriminative learning of linear predictors (e.g. SVM or Logistic Regression) using fast online learning algorithms. Bolt is aimed at large-scale, high-dimensional and sparse machine-learning problems. In particular, problems encountered in information retrieval and natural language processing.
  • MDP (Modular toolkit for Data Processing), provides a library of widely used data processing algorithms that can be combined according to a pipeline analogy to build more complex data processing software. Implemented algorithms include Principal Component Analysis (PCA), Independent Component Analysis (ICA), Slow Feature Analysis (SFA), and many more.
  • PyMVPA, another package for ML and pattern classification analyses of large datasets, applied to neuroimaging studies.
  • PyML, yet another ML library.
  • NLTK, provides a set of utilities for dealing with linguistic data and documentation for research and development in natural language processing and text analytics.

Of course, there also are some full-featured application, like Orange that aims to provide a comparable interface to Weka for machine learning and data mining. Another application that I discovered two years ago is VisTrails, which is

an open-source scientific workflow and provenance management system developed at the University of Utah that provides support for data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, such as simulations, data analysis and visualization, very little is repeated---change is the norm. As an engineer or scientist generates and evaluates hypotheses about data under study, a series of different, albeit related, workflows are created while a workflow is adjusted in an interactive process. VisTrails was designed to manage these rapidly-evolving workflows.

Finally, Mayavi is great for data visualization, especially in 3D. It relies on VTK. It is included in the Enthought flavoured version of Python, together with Chaco for 2D plotting. To get an idea, look at Travis Vaught's nice screencast in Multidimensional Data Visualization in Python - Mixing Chaco and Mayavi.

Useful enhanced shells for Python include IPython, IEP, Spyder. And if you like syntax highlighting in your console, then bpython is just fine (and it works like a charm on OS X).

---

Articles with the same tag(s):

Twenty canonical questions in machine learning
Do a large amount of consulting
Python for interactive scientific data visualization
Audit trails in statistical project
Dose finding studies and cross-over trials
Evidence-based medicine and clinical diagnosis
Exploratory data mining and data cleaning
Emacs auto-completion for Python
Dicing with death -- Chance, risk and health
User-friendly statistical packages

---