Processing large data: Duckdb vs. datatable

February 17, 2023

I read Rasmus Bååth’s last post, Three strategies to tackle Big Data in R and Python, and I thought I would run my own benchmark on duckdb with the data discussed in my review of Exploratory Desktop some years ago.¹ Moreover, I’m interested in trying out Python datatable fread function.

[2023-04-06]
See also In-process, in-memory databases for a more detailed comparison of DuckDB vs. SQLite.

Duckdb has long been a database backend I wanted to try, along with TileDB (I followed Dirk Eddelbuettel work on its R frontend, but go checkout his website for more information; I also warmly recommend his online course on Data Science Programming Methods).

Let’s benchmark loading some data then. I fetched data for year 2021 (1.6 Go), which includes 3,669,928 records according to the codebook.² For the sake of comparison, loading the same data file in sqlite took less than 1,5 minutes, with little RAM caching. Simple queries are a breeze as expected, although I forgot to enable proper timing (.timer on):

sqlite> .mode csv nat2021us
sqlite> .import nat2021us.csv nat2021us
sqlite> SELECT DISTINCT dob_mm, COUNT(*) AS count FROM nat2021us GROUP BY dob_mm ORDER BY CAST(dob_mm AS UNSIGNED) ASC;
dob_mm,count
1,277533
2,266725
3,303139
4,293630
5,301343
6,314024
7,326611
8,330740
9,326280
10,315909
11,302309
12,311685

Now using duckdb:

import duckdb
import time

tic = time.time()
con = duckdb.connect(database='duckdb_python_db')
con.execute('drop table if exists nat2021us')
con.execute("""
  create table nat2021us as
  select * from read_csv_auto("nat2021us.csv")
""")
toc = time.time()
con.close()
print(f"Elapsed time: {(toc-tic) * 10**3:.2f} ms")

It shortly raised 99% but then we have to wait for about 1 minute for the data to be available at the IPython prompt:

100% ▕████████████████████████████████████████████████████████████▏
Elapsed time: 62689.09 ms

And here is the datatable approach:

import datatable as dt

tic = time.time()
nat2021us = dt.fread("nat2021us.csv")
toc = time.time()
print(f"Elapsed time: {(toc-tic) * 10**3:.2f} ms")

I was a little bit surprised at first: it took less than 5s (!) , without much more RAM caching:

100% |██████████████████████████████████████████████████| Reading data [done]
Elapsed time: 4136.24 ms

Then, again results are computed at the speed of light:

In [4]: nat2021us[:, dt.count(), dt.by(dt.f.dob_mm)]
Out[4]:
   | dob_mm   count
   |  int32   int64
-- + ------  ------
 0 |      1  277533
 1 |      2  266725
 2 |      3  303139
 3 |      4  293630
 4 |      5  301343
 5 |      6  314024
 6 |      7  326611
 7 |      8  330740
 8 |      9  326280
 9 |     10  315909
10 |     11  302309
11 |     12  311685
[12 rows x 2 columns]

Clearly, the datatable package is a winner here. I’ll keep these quick results in mind when I have to load large datasets in Python.

♪ Latarnik • Marianna

See also Processing large CSV files. ↩︎
National Center for Health Statistics (2021). Data File Documentations, U.S. Natality Data Files, 2021 (machine readable data file and documentation, CD-ROM Series), National Center for Health Statistics, Hyattsville, Maryland. ↩︎

aliquote.org

Processing large data: Duckdb vs. datatable

See Also