Christophe Lalanne
October 8th, 2013
The plural of anecdote is (not) data.
http://blog.revolutionanalytics.com/2011/04/the-plural-of-anecdote-is-data-after-all.html
What are data • How do we collect data • How do we store data • How to process data with R
Data come in various forms, and they are gathered from many different sources:
They almost always require preprocessing, cleaning, and recoding—and this often amounts to 60% of the whole statistical project.
In the end, we must have a random sample of statistical units with single or multiple observations on each unit. With this sample, we usually aim at inferring some properties at the level of the population from which this 'representative' (Selz, 2013) sample was drawn.
Better to rely on plain text files (space, tab or comma-delimited) or database (SQL family) whenever possible:
The last two points are part of what is called reproducible research (Peng, 2009).
.mat
, .sav
, .dta
)Data can also be served through different files, e.g. dual files like Nifti .hdr
+.img
, or PLINK .bed
+.bim
(require merging).
Text-based data file are generally easy to process with many text processing programs, e.g., sed, awk, Perl (Chambers, 2008), but we will see that R offers nice facilities for that kind of stuff too.
Observers rated relatedness of pairs of images of children that were either siblings or not on a 11-point scale.
SimRating | sibs | agediff | gendiff | Obs | Image | |
---|---|---|---|---|---|---|
1 | 10.00 | 1 | 29.00 | diff | S1 | Im1 |
2 | 10.00 | 1 | 37.00 | diff | S1 | Im2 |
3 | 6.00 | 1 | 47.00 | diff | S1 | Im3 |
4 | 1.00 | 1 | 44.00 | diff | S1 | Im4 |
5 | 10.00 | 1 | 25.00 | diff | S1 | Im5 |
6 | 9.00 | 1 | 0.00 | same | S1 | Im6 |
Maloney, L. T., and Dal Martello, M. F. (2006). Kin recognition and the perceived facial similarity of children. Journal of Vision, 6(10):4, 1047–1056, http://journalofvision.org/6/10/4/.
Raw data generally require data cleansing.
id | centre | sex | age | session | recover |
---|---|---|---|---|---|
1017 | 1 | 1 | 23 | 8 | 0 |
1023 | 1 | 126 | 6 | 1 | |
2044 | 2 | 1 | 24 | -1 | 1 |
2086 | 2 | 2 | 27 | ? | 0 |
What does recover
= 0 or sex
= 1 mean? Are negative values allowed for session
? Leading zeros in IDs and centre number are lost. We need a data dictionary.
$ head -n 3 data/raw.csv
id,centre,sex,age,session,recover
,,,,years,
01017,01,1,23,8,0
R is a statistical package for working with data (Ihaka & Gentleman, 1996; Venables & Ripley, 2002). It features most of modern statistical tools, and convenient functions to import, manipulate and visualize even high-dimensional data sets. There is usually no need to switch to another program throughout project management.
It is open-source (GPL licence), and it has a very active community of users (r-help, Stack Overflow).
Moreover, some mathematical or statistical packages (e.g., Mathematica, SPSS) feature built-in plugins to interact with R. It can also be used with Python (rpy2).
R is interactive (read-eval-print loop) and can be used as a simple calculator:
r <- 5
2 * pi * r^2
[1] 157.1
Basically, R interprets commands that are sent by the user, and returns an output (which might be nothing). The syntax of the language is close to that of any programming language: we process data associated to variables with the help of dedicated commands (functions).
Note that many additional packages are available on http://cran.r-project.org, and we will use few of them.
RStudio offers a clever, non-intrusive, interface to R, including specific panels for the R console, history, workspace, plots, help, packages management.
It now features project management, version control, and it has built-in support for automatic reporting through the Markdown language.
Here is a sample session.
First, we load some data into R:
bs <- read.table(file = "./data/brain_size.dat",
header = TRUE, na.strings = ".")
head(bs, n = 2)
Gender FSIQ VIQ PIQ Weight Height MRI_Count
1 Female 133 132 124 118 64.5 816932
2 Male 140 150 124 NA 72.5 1001121
Here is how the data would look under a spreedsheat program:
Variables are arranged in columns, while each line represents an individual (i.e., a statistical unit).
Let's look at some of the properties of the imported data set:
dim(bs)
[1] 40 7
names(bs)[1:4]
[1] "Gender" "FSIQ" "VIQ" "PIQ"
There are 40 individuals and 7 variables. Variables numbered 1 to 4 are: gender, full scale IQ, verbal IQ, and performance IQ.
Now, here is how observed values for each variable are represented internally:
str(bs, vec.len = 1)
'data.frame': 40 obs. of 7 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 ...
$ FSIQ : int 133 140 ...
$ VIQ : int 132 150 ...
$ PIQ : int 124 124 ...
$ Weight : int 118 NA ...
$ Height : num 64.5 72.5 ...
$ MRI_Count: int 816932 1001121 ...
Variables are either numeric (continuous or discrete) or categorical (with ordered or unordered levels).
Note that there exist other representations for working with dates, censored variables, or strings.
RStudio greatly facilitates the data analysis workflow: Rather than writing directly R commands in the console, write them in an R script and run the code from the editor.
Chambers J (2008). Software for Data Analysis. Programming with R. Springer.
Ihaka R and Gentleman R (1996). “R: A language for data analysis and graphics.” Journal of Computational and Graphical Statistics, 5(3), pp. 299-314.
Peng R (2009). “Reproducible Research and Biostatistics.” Biostatistics, 10(3), pp. 405-408.
Selz M (2013). La représentativité en statistiques, volume 8 series M\'ethodes et Savoirs. Publications de l'INED.
Venables W and Ripley B (2002). Modern Applied Statistics with S, 4th edition. Springer. ISBN 0-387-95457-0.