Back in 2008 I started looking at what’s available in Python for statistical analysis on OS X. Few years after, there already was a growing stack of packages. However, I never felt comfortable using Python for data processing other than basic web scraping or bioinformatics stuff. I always found R much more suitable for structured data. In the mean time, I’ve been following the advance of the h2o machine learning library for a long time now, and I used to rely on it a little as a replacement for the caret package, now part of (or replaced with) tidymodels. Everything’s must be tidy, right?1 The documentation is quite good actually (PDF version), and there are many tutorials available on Github. I know there are bindings for Scala and Java as well, but I’m not really interest in those languages. Let’s see how it goes when using Python for the next few months, and I shall write another blog post of the kind.
What’s the deal after all? Basically, I don’t really like using the scikit-learn package — this has nothing to do with the authors or the package itself, this is just me not so versed into using a mix of functions and methods all in an interpreted OO language. H2o or weka are great alternatives in my view, and they are available for R too. Being able to work with such great libraries from both R or Python is a huge win, IMHO. Likewise, I don’t really like Pandas, but there’s Spark which also handles data frame-like object.
Basically, Spark is the successor of Hadoop, with built-in modules for processing streaming or SQL-like data, machine learning or even graph processing. It runs in the console using a Scala or Python backend. In case you use Python, don’t forget to set the relevant version of Python you want to use in your shell config, e.g.,
export PYSPARK_PYTHON=python3 if you’re using Bash or Zsh. If you are an R and/or RStudio user, you probably alredy know the sparklyr package. If you’re into Python, the corresponding standalone package is pyspark (beware that it’s a large download). While data frame objects are available on Spark, the core data structure is a Resident Distributed Dataset (RDD), which stands for an immutable (i.e., read-only) partitioned collection of records. Thanks to in-memory processing computation, it alleviates the large time spent in disk IO read/write that is commonly encoutered with Hadoop. The corresponding file format for data import/export is usually Parquet, while Spark SQL acts as the main query language to access those columnar data structures. However, you can still rely on data frames. So far, so good.
Finally, plotly is a handy replacement for matplotlib or seaborn, while I can imagine that dash could easily replace most of Shiny functionalities at some point. Althat to say that I am looking for strong and reliable libraries available cross-platform, in R or Python. Plotly is more a matter of convenience than a real taste, to be honest. While I greatly appreciate the quality of D3 and Vega rendering, I don’t like the syntax at all, let alone the verbose way of inputing and preprocessing data. Having some convenient wrappers to use the plotly backend from R or Python is, however, fine with me. The fact that you can build a custom ggplot graphical object and just use
ggplotly() to get a plotly graph displayed in your browser or a Shiny app is clearly a win.
Since I’m doing quite a lot of stuff in Python these days, why not take a closer look at the data mining stack available there? I will probably write separate posts to highlight what features I find particularly useful (or not) and some of the workflows built upon these tools.