Rstats

Interactive Data Visualization With Cranvas

One of the advantage of R over other popular statistical packages is that it now has “natural” support for interactive and dynamic data visualization. This is, for instance, something that is lacking with the Python ecosystem for scientific computing (Mayavi or Enthought Chaco are just too complex for what I have in mind). Some time ago, I started drafting some tutors on interactive graphics with R. The idea was merely to give an overview of existing packages for interactive and dynamic plotting, and it was supposed to be a three-part document: first part presents basic capabilities like rgl, aplpack, and iplot (aka Acinonyx)–this actually ended up as a very coarse draft; second part should present ggobi and its R interface; third and last part would be about the Qt interface, with qtpaint and cranvas.

Back from the BoRdeaux conference

Here is a quick wrap up of the BoRdeaux conference. I won’t detail the conference program itself, but just drop some words on packages that were presented together with their applications (in various fields: epidemiology, social sciences, teaching, high dimensional data, chemometrics). Multivariate data analysis St├ęphanie Bougeard talked about two new functions in the ade4 package aiming at the analysis of K+1 tables (several blocks of explanatory variables and a block of response variables).

Easy creation of videos with R

While preparing a talk due in three days or so, I thought it would be good to show some live demonstration of regularization techniques in regression with ggplot2. It sounds like a lot of people start with splines or polynomial regression to demonstrate overfitting. I believe this has something to do with Bishop’s book on Pattern Recognition and Machine Learning, see e.g. Shane Conway’s recap’ on Stanford ML 5.2: Regularization .

Easier literate programming with R

I have been using Sweave over the past 5 or 6 years for processing my R documents, and I have been quite happy with this program. However, with the recent release of knitr (already adopted on UCLA Stat Computing and on Vanderbilt Biostatistics Wiki) and all of its nice enhancements, I really need to get more familiar with it. In fact, there’s a lot of goodies in Yihui Xie’s knitr, including the automatic processing of graphics (no need to call print() to display a lattice object), local or global control of height/width for any figures, removal of R’s prompt (R’s output being nicely prefixed with comments), tidying and highlighting facilities, image cropping, use of framed or listings for embedding code chunk.

Playing with TwitteR

Some months ago, I played with Unx command-line tools to parse my tweets fetched from BackupMyTweets. Here is a more elegant to do so with R. Well, the code is rather simple and most of what we need is already available through the twitteR package. library(twitteR) library(stringr) my.tweets <- userTimeline("chlalanne", n=1000) Suppose I want to display the frequency of tags I use in my messages: find.tag <- function(x) unlist(str_extract_all(x$getText(), "#[A-Za-z0-9]")) # a little test to see whether it works or not # for (i in 1:20) cat(i, ":", find.

Design of experiment in R

When I started writing my companion textbook for Montgomery’s Design and Analysis of Experiments, there was not so much dedicated package available on CRAN. Now, I realize that there are a lot of very handy packages on CRAN. Most of them were released in 2010 and are listed in the correspoding Task View, ExperimentalDesign. In the White Book (Statistical Models in S, Chambers & Hastie, 1992), section 5.2.3 pp. 169-175 is dedicated to full- and fractional factorial designs, with fac.

Using bootstrap in cluster analysis

Lastly, I came again across a nice poster from Tom Nichols that was presented at the OHBM2010 conference. This was about Finding Distinct Genetic Factors that Influence Cortical Thickness, and the clustering method that was used was interesting. Apart from choosing the right metric, parametric bootstrap was used to assess the stability compactness and separation of clusters as measured with the silhouette width. The need for a validation procedure aims at addressing the following two issues in cluster analysis: (1) bias toward particular cluster properties, and (2) non-significance of results when there’re no true clusters.

Recursive feature elimination coupled to SVM in R

I recently wondered whether it is possible to apply Recursive Feature Selection with SVM in R. RFE+SVM is well described in the litterature. It has originally been proposed by Guyon and coworkers for gene selection, in Gene Selection for Cancer Classification using Support Vector Machines (Machine Learning (2002) 46(1)3), 389-422). Here are two more recent papers that show application of RFE+SVM in a similar context and discuss its performance as a wrapper method for feature selection:

Pretty printing statistical distribution tables

Before using statistical software, we were teached to use standard tables for finding the quantile of a standard normal or a Student’s t distribution, given a type 1 risk. Here is a way to quickly print them on a PDF file using R and $\LaTeX$. The R part I’m considering this Table as an example of what I want to achieve. pp <- c(.10, .05, .025, .01, .005, .

Computing intraclass correlation with R

I always found Dave Garson’s tutorial on Reliability Analysis very interesting. However, all illustrations are with SPSS. Here is a friendly R version of some of these notes, especially for computing intraclass correlation. Background They are different versions of the intraclass correlation coefficient (ICC), that reflect distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):