user2014

2014-07-04

Here are some notes on user2014 (no it's not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.

I wish I could attend the conference but I am a bit short of time at the moment, so I'm just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.

Here is a collection of useful links I found on Twitter:

  • sgrifter: Using random forests in #rstats?@JohnEhrlinger has a nice application of ggplot2 https://t.co/SIoMT3EbHq #useR2014
  • ledell: Presenting my new R package for ensemble learning, subsemble, at #user2014 in 1 hr. Now available on CRAN: http://t.co/GBJMlra67a #rstats
  • TrestleJeff: Couldn't make it to #user2014? Here's a preview of the slides from my rmarkdown talk tomorrow morning. http://t.co/X9xYyWPJZA #rstats
  • revodavid: Slides from my #user2014 talk, “R and reproducibility - a proposal” http://t.co/72bLVsk1uF #rstats
  • MMaechler: http://t.co/VDc1Mszf2b slides of my talk at #useR2014... Yes, I should find a more modern publication venue!
  • _inundata: This is cool → statsTeachR open-access online repo of modular lesson plans for teaching statistics using R http://t.co/3Vl7PVmc85 #useR2014
  • seandavis12: #diigo: useR 2014: Fostering the next generation of Open Science with R http://t.co/oNFf61WhEt
  • ramnath_vaidya: Interactive slides from my Interactive Visualization presentation at #user2014. http://t.co/sLLUmy1he5 #rstats http://t.co/OMS1TFkzAM
  • revodavid: Getting a demo of example apps built in OpenCPU from @JeroenOoms . Try them here: http://t.co/smLnPQ2yA4 #rstats #user2014
  • hspter: RCloud looks awesome! https://t.co/SMqgF8JZsR #user2014
  • DataJunkie: Some great examples of rcharts visualizations: http://t.co/1FfPGtBFu3 #Rstats #useR2014

And here are some additional links by Max Kuhn and Joseph Rickert.

As someone used to say, it is good from time to time to re-educate people who started learning R long ago. Recently, I followed one of Coursera data science tracks, Getting and Cleaning Data by Jeff Leek, and I learned a lot about data munging, although I think I have a fair knowledge of R. Up to now, I've always been a great fan of base R, using few extra packages beside Hmisc, rms and psych, and caret. I like the formula interface, and I always found it very convenient that we could use almost the same formula with aggregate(), {g}lm(), and lattice graphics, while subset() provides me with most of what I need to filter observations and variables in a working data frame. Since Hmisc and rms enhance those functionalities, I was quite happy.

It turns out that Hadley Wickham started developing new tools long ago (it all started with ggplot, now ggplot2), and we now have a plethora of new packages that help tidying and processing data. There's even an on-line textbook on Advanced R, which I hope will remain in open access. I've never spent too much time in plyr or ggplot2 because at some point I felt that it was too much at an angle with base R. I like that a programming language exhibits some kind of internal consistency, and these new packages introduced too much 'new vocabularies'. Well, ok, internal consistency is not the best way to think about R contributed packages (I'm not speaking of core packages), and as discussed by Max Kuhn many packages do not follow R conventions. So what's the point? Either I'm too old to update my knowledge of R, or I'm just reluctant to switch to another paradigm for data science with R. I guess I just have to try the new kids on the blocks: data.table, dplyr, tidyr, and magrittr,(a) and see how I can integrate them in my own workflow.

Yet another think that has bothered me is the advance of browser-based computation, e.g. IPython Notebook or using javascript to produce graphics as in Vega, D3.js, etc. While I appreciate the merits of such approaches, I always found it more convenient to stay in a console or in an Emacs buffer to process data. I just find annoying that Julia hasn't any built-in graphical devices and must rely on Winston or Gadfly, for example. What Julia is missing above all is interactive or dynamic data visualization, which to me appears a key component of exploratory data analysis. So, I'm happy to see that ggvis provides SVG graphics but can be used through RStudio.(b)

Finally, what about the new RMarkdown? Well, that's just really promising and I look forward to using it for my courses.

Notes

(a) One thing that I always found quite annoying is typing the name of a data frame and seeing all of its data in the R console (often reaching max.print values, which amounts to 99999 actually!). Of course, head(d) is a better option, but we now have two defaults S4 print methods that will simply print the first entries when reading data file, say a CSV file: dplyr::tbl_df and data.table::fread().

By the way, data.table::fread() is really fast. E.g., using Hadley Wickham's dplyr tutorial,

> system.time(d <- fread("flights.csv"))
utilisateur     système      écoulé 
      0.265       0.013       0.282 
> system.time(d <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE)))
utilisateur     système      écoulé 
      5.825       0.085       6.031

(b) For interactive visualization, be sure to install ggvis from GitHub instead of CRAN since it will likely crash with Rstudio.

---

Articles with the same tag(s):

Multi-Group comparison in Partial Least Squares Path Models
Yet another gray theme for R base graphics
Writing a book
R, pipes and Co.
R Graphs Cookbook
Emacs Org-mode and literate programming
Back from the VI EAM conference
Reproducible research with R
Audit trails in statistical project
Interactive data visualization with cranvas

---