Here are some notes on user2014 (no it’s not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.
I wish I could attend the conference but I am a bit short of time at the moment, so I’m just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.
Here is a collection of useful links I found on Twitter:
And here are some additional links by Max Kuhn and Joseph Rickert.
As someone used to say, it is good from time to time to re-educate people who started learning R long ago. Recently, I followed one of Coursera data science tracks, Getting and Cleaning Data by Jeff Leek, and I learned a lot about data munging, although I think I have a fair knowledge of R. Up to now, I’ve always been a great fan of base R, using few extra packages beside Hmisc, rms and psych, and caret. I like the formula interface, and I always found it very convenient that we could use almost the same formula with aggregate()
, {g}lm()
, and lattice
graphics, while subset()
provides me with most of what I need to filter observations and variables in a working data frame. Since Hmisc
and rms
enhance those functionalities, I was quite happy.
It turns out that Hadley Wickham started developing new tools long ago (it all started with ggplot, now ggplot2), and we now have a plethora of new packages that help tidying and processing data. There’s even an on-line textbook on Advanced R, which I hope will remain in open access. I’ve never spent too much time in plyr
or ggplot2
because at some point I felt that it was too much at an angle with base R. I like that a programming language exhibits some kind of internal consistency, and these new packages introduced too much ’new vocabularies’. Well, ok, internal consistency is not the best way to think about R contributed packages (I’m not speaking of core packages), and as discussed by Max Kuhn many packages do not follow R conventions. So what’s the point? Either I’m too old to update my knowledge of R, or I’m just reluctant to switch to another paradigm for data science with R. I guess I just have to try the new kids on the blocks: data.table, dplyr, tidyr, and magrittr, and see how I can integrate them in my own workflow.
One thing that I always found quite annoying is typing the name of a data frame and seeing all of its data in the R console (often reaching max.print
values, which amounts to 99999 actually!). Of course, head(d)
is a better option, but we now have two defaults S4 print
methods that will simply print the first entries when reading data file, say a CSV file: dplyr::tbl_df
and data.table::fread()
.
By the way, data.table::fread()
is really fast. For instance, using Hadley Wickham’s dplyr tutorial, I got the following timings:
> system.time(d <- fread("flights.csv"))
utilisateur système écoulé
0.265 0.013 0.282
> system.time(d <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE)))
utilisateur système écoulé
5.825 0.085 6.031
Yet another think that has bothered me is the advance of browser-based computation, e.g. IPython Notebook or using javascript to produce graphics as in Vega, D3.js, etc. While I appreciate the merits of such approaches, I always found it more convenient to stay in a console or in an Emacs buffer to process data. I just find annoying that Julia hasn’t any built-in graphical devices and must rely on Winston or Gadfly, for example. What Julia is missing above all is interactive or dynamic data visualization, which to me appears a key component of exploratory data analysis. So, I’m happy to see that ggvis provides SVG graphics but can be used through RStudio.1
Finally, what about the new RMarkdown? Well, that’s just really promising and I look forward to using it for my courses.
For interactive visualization, be sure to install ggvis
from GitHub instead of CRAN since it will likely crash with Rstudio. ↩︎