UseR 2014

July 4, 2014

Here are some notes on user2014 (no it’s not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.

I wish I could attend the conference but I am a bit short of time at the moment, so I’m just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.

Here is a collection of useful links I found on Twitter:

sgrifter: Using random forests in #rstats? @JohnEhrlinger has a nice application of ggplot2 https://t.co/SIoMT3EbHq #useR2014
ledell: Presenting my new R package for ensemble learning, subsemble, at #user2014 in 1 hr. Now available on CRAN: http://t.co/GBJMlra67a #rstats
TrestleJeff: Couldn’t make it to #user2014? Here’s a preview of the slides from my rmarkdown talk tomorrow morning. http://t.co/X9xYyWPJZA #rstats
revodavid: Slides from my #user2014 talk, “R and reproducibility - a proposal” http://t.co/72bLVsk1uF #rstats
MMaechler: http://t.co/VDc1Mszf2b slides of my talk at #useR2014… Yes, I should find a more modern publication venue!
_inundata: This is cool → statsTeachR open-access online repo of modular lesson plans for teaching statistics using R http://t.co/3Vl7PVmc85 #useR2014
seandavis12: #diigo: useR 2014: Fostering the next generation of Open Science with R http://t.co/oNFf61WhEt
ramnath_vaidya: Interactive slides from my Interactive Visualization presentation at #user2014. http://t.co/sLLUmy1he5 #rstats http://t.co/OMS1TFkzAM
revodavid: Getting a demo of example apps built in OpenCPU from @JeroenOoms . Try them here: http://t.co/smLnPQ2yA4 #rstats #user2014
hspter: RCloud looks awesome! https://t.co/SMqgF8JZsR #user2014
DataJunkie: Some great examples of rcharts visualizations: http://t.co/1FfPGtBFu3 #Rstats #useR2014

And here are some additional links by Max Kuhn and Joseph Rickert.

As someone used to say, it is good from time to time to re-educate people who started learning R long ago. Recently, I followed one of Coursera data science tracks, Getting and Cleaning Data by Jeff Leek, and I learned a lot about data munging, although I think I have a fair knowledge of R. Up to now, I’ve always been a great fan of base R, using few extra packages beside Hmisc, rms and psych, and caret. I like the formula interface, and I always found it very convenient that we could use almost the same formula with aggregate(), {g}lm(), and lattice graphics, while subset() provides me with most of what I need to filter observations and variables in a working data frame. Since Hmisc and rms enhance those functionalities, I was quite happy.

It turns out that Hadley Wickham started developing new tools long ago (it all started with ggplot, now ggplot2), and we now have a plethora of new packages that help tidying and processing data. There’s even an on-line textbook on Advanced R, which I hope will remain in open access. I’ve never spent too much time in plyr or ggplot2 because at some point I felt that it was too much at an angle with base R. I like that a programming language exhibits some kind of internal consistency, and these new packages introduced too much ‘new vocabularies’. Well, ok, internal consistency is not the best way to think about R contributed packages (I’m not speaking of core packages), and as discussed by Max Kuhn many packages do not follow R conventions. So what’s the point? Either I’m too old to update my knowledge of R, or I’m just reluctant to switch to another paradigm for data science with R. I guess I just have to try the new kids on the blocks: data.table, dplyr, tidyr, and magrittr, and see how I can integrate them in my own workflow.

One thing that I always found quite annoying is typing the name of a data frame and seeing all of its data in the R console (often reaching max.print values, which amounts to 99999 actually!). Of course, head(d) is a better option, but we now have two defaults S4 print methods that will simply print the first entries when reading data file, say a CSV file: dplyr::tbl_df and data.table::fread().

By the way, data.table::fread() is really fast. For instance, using Hadley Wickham’s dplyr tutorial, I got the following timings:

> system.time(d <- fread("flights.csv"))
utilisateur     système      écoulé 
      0.265       0.013       0.282 
> system.time(d <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE)))
utilisateur     système      écoulé 
      5.825       0.085       6.031

Yet another think that has bothered me is the advance of browser-based computation, e.g. IPython Notebook or using javascript to produce graphics as in Vega, D3.js, etc. While I appreciate the merits of such approaches, I always found it more convenient to stay in a console or in an Emacs buffer to process data. I just find annoying that Julia hasn’t any built-in graphical devices and must rely on Winston or Gadfly, for example. What Julia is missing above all is interactive or dynamic data visualization, which to me appears a key component of exploratory data analysis. So, I’m happy to see that ggvis provides SVG graphics but can be used through RStudio.¹

Finally, what about the new RMarkdown? Well, that’s just really promising and I look forward to using it for my courses.

For interactive visualization, be sure to install ggvis from GitHub instead of CRAN since it will likely crash with Rstudio. ↩︎

aliquote.org

UseR 2014

See Also