As a sequel of one of my last benchmarking post, here’s a rough estimate of Polars vs. Datatable performance when reading a 34 Mb file (NYC Flights Dataset, available from avrious sources; see e.g., this post):
import time
import datatable
import polars
tic = time.time()
flights = polars.read_csv("nycflights.csv", null_values="NA")
toc = time.time()
print(f"Elapsed time: {(toc-tic) * 10**3:.2f} ms")
## => Elapsed time: 54.61 ms
tic = time.time()
flights = datatable.fread("nycflights.csv")
toc = time.time()
print(f"Elapsed time: {(toc-tic) * 10**3:.2f} ms")
## => Elapsed time: 57.60 ms
Polars also offers a lazy CSV reader using scan_csv
, which is way faster (1.22 ms). #python