Here are some notes on user2014 (no it’s not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.
I wish I could attend the conference but I am a bit short of time at the moment, so I’m just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.
Here is a collection of useful links I found on Twitter:
- sgrifter: Using random forests in #rstats? @JohnEhrlinger has a nice application of ggplot2 https://t.co/SIoMT3EbHq #useR2014
- ledell: Presenting my new R package for ensemble learning, subsemble, at #user2014 in 1 hr. Now available on CRAN: http://t.co/GBJMlra67a #rstats
- TrestleJeff: Couldn’t make it to #user2014? Here’s a preview of the slides from my rmarkdown talk tomorrow morning. http://t.co/X9xYyWPJZA #rstats
- revodavid: Slides from my #user2014 talk, “R and reproducibility - a proposal” http://t.co/72bLVsk1uF #rstats
- MMaechler: http://t.co/VDc1Mszf2b slides of my talk at #useR2014… Yes, I should find a more modern publication venue!
- _inundata: This is cool → statsTeachR open-access online repo of modular lesson plans for teaching statistics using R http://t.co/3Vl7PVmc85 #useR2014
- seandavis12: #diigo: useR 2014: Fostering the next generation of Open Science with R http://t.co/oNFf61WhEt
- ramnath_vaidya: Interactive slides from my Interactive Visualization presentation at #user2014. http://t.co/sLLUmy1he5 #rstats http://t.co/OMS1TFkzAM
- revodavid: Getting a demo of example apps built in OpenCPU from @JeroenOoms . Try them here: http://t.co/smLnPQ2yA4 #rstats #user2014
- hspter: RCloud looks awesome! https://t.co/SMqgF8JZsR #user2014
- DataJunkie: Some great examples of rcharts visualizations: http://t.co/1FfPGtBFu3 #Rstats #useR2014
As someone used to say, it is good from time to time to re-educate people who started learning R long ago. Recently, I followed one of Coursera data science tracks, Getting and Cleaning Data by Jeff Leek, and I learned a lot about data munging, although I think I have a fair knowledge of R. Up to now, I’ve always been a great fan of base R, using few extra packages beside Hmisc, rms and psych, and caret. I like the formula interface, and I always found it very convenient that we could use almost the same formula with
lattice graphics, while
subset() provides me with most of what I need to filter observations and variables in a working data frame. Since
rms enhance those functionalities, I was quite happy.
It turns out that Hadley Wickham started developing new tools long ago (it all started with ggplot, now ggplot2), and we now have a plethora of new packages that help tidying and processing data. There’s even an on-line textbook on Advanced R, which I hope will remain in open access. I’ve never spent too much time in
ggplot2 because at some point I felt that it was too much at an angle with base R. I like that a programming language exhibits some kind of internal consistency, and these new packages introduced too much ‘new vocabularies’. Well, ok, internal consistency is not the best way to think about R contributed packages (I’m not speaking of core packages), and as discussed by Max Kuhn many packages do not follow R conventions. So what’s the point? Either I’m too old to update my knowledge of R, or I’m just reluctant to switch to another paradigm for data science with R. I guess I just have to try the new kids on the blocks: data.table, dplyr, tidyr, and magrittr, and see how I can integrate them in my own workflow.
One thing that I always found quite annoying is typing the name of a data frame and seeing all of its data in the R console (often reaching
max.print values, which amounts to 99999 actually!). Of course,
head(d) is a better option, but we now have two defaults S4
By the way,
data.table::fread() is really fast. For instance, using Hadley Wickham’s dplyr tutorial, I got the following timings:
> system.time(d <- fread("flights.csv")) utilisateur système écoulé 0.265 0.013 0.282 > system.time(d <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE))) utilisateur système écoulé 5.825 0.085 6.031
Finally, what about the new RMarkdown? Well, that’s just really promising and I look forward to using it for my courses.