Data tables in Python

January 31, 2023

Although Pandas is a great package to manipulate rectangular data structures, I often find myself a bit lost with its syntax (the use of loc or iloc, for instance). I learned about the datatable package a while ago, then as usual forgot about it. Here is a short overview of how I generally use it for biostatistical stuff I happen to do during my free time.

Let’s get some data first. We will consider the ToothGrowth dataset from R builtin’s, which deals with the effect of vitamin C on tooth growth in guinea pigs:

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Fortunately, it is available in Python thanks to the rdatasets package:

from rdatasets import data, descr
from plotnine import *
import datatable as dt

print(descr("ToothGrowth"))
dataset = data("ToothGrowth")
dataset.head()
##     len supp  dose
## 0   4.2   VC   0.5
## 1  11.5   VC   0.5
## 2   7.3   VC   0.5
## 3   5.8   VC   0.5
## 4   6.4   VC   0.5

For those familiar with the R data.table package, the syntax will be easy to grasp since it is 95% the same in the most use cases. For example, if we want to aggregate data using group means, we can proceed as follows:

d = dt.Frame(dataset)
d[:, dt.mean(dt.f.len), dt.by("supp", "dose")]
##    | supp      dose      len
##    | str32  float64  float64
## -- + -----  -------  -------
##  0 | OJ         0.5    13.23
##  1 | OJ         1      22.7
##  2 | OJ         2      26.06
##  3 | VC         0.5     7.98
##  4 | VC         1      16.77
##  5 | VC         2      26.14
## [6 rows x 3 columns]

Other variations are possible, e.g. count the number of unique or missing values, summarize the data table using mean, min or max, etc. Here are tow further examples:

d.mean()
##    |     len     supp     dose
##    | float64  float64  float64
## -- + -------  -------  -------
##  0 | 18.8133       NA  1.16667
## [1 row x 3 columns]

d[:, dt.count(), dt.by("supp")]
##    | supp   count
##    | str32  int64
## -- + -----  -----
##  0 | OJ        30
##  1 | VC        30
## [2 rows x 2 columns]

Like the R data.table package, there’s an over-optimized fread function which can handle CSV, Excel, and many more formats, while automagically detecting the appropriate type of variable. Moreover it is supposed to handle large data files.¹

The H2O.ai team discusses the basic features of the datatable package on their blog: Introducing DatatableTon – Python Datatable Tutorials & Exercises. You may also like An Overview of Python’s Datatable package.

As a sequel to my previous post on seaborn and plotnine, let’s see how we can reproduce the following plot (left panel), which was made in R:

R code is shown below:²

r <- aggregate(len ~ dose + supp, data = ToothGrowth, mean)
p <- ggplot(data = ToothGrowth, aes(x = dose, y = len, color = supp)) +
       geom_point(position = position_jitterdodge(jitter.width = .1, dodge.width = 0.25)) +
       geom_line(data = r, aes(x = dose, y = len, color = supp)) +
       scale_color_manual(values = c("steelblue", "orange")) +
       guides(color = FALSE) +
       geom_dl(aes(label = supp), method = list("smart.grid", cex = .8)) +
       labs(x = "Dose (mg/day)", y = "Length (oc. unit)")
p

And here is the Python version:

r = d[:, dt.mean(dt.f.len), dt.by("supp", "dose")]
(ggplot(d, aes(x="dose", y="len", color="supp"))
  + geom_point(position=position_jitterdodge(jitter_width=.1, dodge_width=0.25))
  + geom_line(r, aes(x="dose", y="len", color="supp"))
  + scale_color_manual(values=["steelblue", "orange"], guide=False)
  + labs(x="Dose (mg/day)", y="Length (oc. unit)")
  + theme_minimal())

Not much of a difference either way. See the final result on the right panel of the previous figure.

♪ Badflower • Ghost

It would be interesting to benchamrk datatable.fread against R data.table or pandas.read_csv, like I did in my review of Exploratory Desktop. ↩︎
Shorter version for the impatient: lattice::xyplot(len ∼ dose, data = ToothGrowth, groups = supp, type = c ("p", "a")). ↩︎

aliquote.org

Data tables in Python

See Also