Here are some notes on user2014 (no it's not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.
I wish I could attend the conference but I am a bit short of time at the moment, so I'm just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.
Here is a collection of useful links I found on Twitter:
As someone used to say, it is good from time to time to re-educate people who started learning R long ago. Recently, I followed one of Coursera data science tracks, Getting and Cleaning Data by Jeff Leek, and I learned a lot about data munging, although I think I have a fair knowledge of R. Up to now, I've always been a great fan of base R, using few extra packages beside Hmisc, rms and psych, and caret. I like the formula interface, and I always found it very convenient that we could use almost the same formula with
lattice graphics, while
subset() provides me with most of what I need to filter observations and variables in a working data frame. Since
rms enhance those functionalities, I was quite happy.
It turns out that Hadley Wickham started developing new tools long ago (it all started with ggplot, now ggplot2), and we now have a plethora of new packages that help tidying and processing data. There's even an on-line textbook on Advanced R, which I hope will remain in open access. I've never spent too much time in
ggplot2 because at some point I felt that it was too much at an angle with base R. I like that a programming language exhibits some kind of internal consistency, and these new packages introduced too much ‘new vocabularies’. Well, ok, internal consistency is not the best way to think about R contributed packages (I'm not speaking of core packages), and as discussed by Max Kuhn many packages do not follow R conventions. So what's the point? Either I'm too old to update my knowledge of R, or I'm just reluctant to switch to another paradigm for data science with R. I guess I just have to try the new kids on the blocks: data.table, dplyr, tidyr, and magrittr, and see how I can integrate them in my own workflow.
One thing that I always found quite annoying is typing the name of a data frame and seeing all of its data in the R console (often reaching
max.print values, which amounts to 99999 actually!). Of course,
head(d) is a better option, but we now have two defaults S4
By the way,
data.table::fread() is really fast. For instance, using Hadley Wickham's dplyr tutorial, I got the following timings:
> system.time(d <- fread("flights.csv")) utilisateur système écoulé 0.265 0.013 0.282 > system.time(d <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE))) utilisateur système écoulé 5.825 0.085 6.031
Finally, what about the new RMarkdown? Well, that's just really promising and I look forward to using it for my courses.