Working with data

Christophe Lalanne

September 29, 2015

Synopsis

The plural of anecdote is (not) data.
http://blog.revolutionanalytics.com/2011/04/the-plural-of-anecdote-is-data-after-all.html

What are data • How do we collect data • How do we store data • How to process data with R

Data

The world of data

Data come in various forms, and they are gathered from many different sources:

personal daily records
observational studies
controlled experiments
historical archives
national surveys, etc.

They almost always require preprocessing, cleaning, and recoding—and this often amounts to 60% of the whole statistical project.

How to collect data

Simple random sampling: each individual from the population has the same probability of being selected; this is usually done without replacement (e.g., surveys, psychological experiment, clinical trials).
Systematic random sampling, (two-stage) cluster sampling, stratified sampling, etc.

In the end, we must have a random sample of statistical units with single or multiple observations on each unit. With this sample, we usually aim at inferring some properties at the level of the population from which this 'representative' (Selz, 2013) sample was drawn.

How to store data

Better to rely on plain text files (space, tab or comma-delimited) or database (SQL family) whenever possible:

easy access to data (in an OS agnostic way)
availability of reliable query languages
recording of each data step in text files
easy storage and sharing of data

The last two points are part of what is called reproducible research (Peng, 2009).

Data processing

full sample single CSV file (RFC4180)
binary data files (.mat, .sav, .dta)
individual data files, possibly compressed (require globbing)

Data can also be served through different files, e.g. dual files like Nifti .hdr+.img, or PLINK .bed+.bim (require merging).

Text-based data file are generally easy to process with many text processing programs, e.g., sed, awk, Perl (Chambers, 2008), but we will see that R offers nice facilities for that kind of stuff too.

A ready to use data set

Observers rated relatedness of pairs of images of children that were either siblings or not on a 11-point scale.

	SimRating	sibs	agediff	gendiff	Obs	Image
1	10.00	1	29.00	diff	S1	Im1
2	10.00	1	37.00	diff	S1	Im2
3	6.00	1	47.00	diff	S1	Im3
4	1.00	1	44.00	diff	S1	Im4
5	10.00	1	25.00	diff	S1	Im5
6	9.00	1	0.00	same	S1	Im6

Maloney, L. T., and Dal Martello, M. F. (2006). Kin recognition and the perceived facial similarity of children. Journal of Vision, 6(10):4, 1047–1056, http://journalofvision.org/6/10/4/.

Real-life data

Raw data generally require data cleansing.

	id	centre	sex	age	session	recover
2	1017.00	1.00	1.00	23	8	0.00
3	1023.00	1.00		126	6	1.00
4	2044.00	2.00	1.00	24	-1	1.00
5	2086.00	2.00	2.00	27	?	0.00

What does recover = 0 or sex = 1 mean? Are negative values allowed for session? Leading zeros in IDs and centre number are lost. We need a data dictionary.

$ head -n 3 data/raw.csv
id,centre,sex,age,session,recover
,,,,years,
01017,01,1,23,8,0

Data analysis as an iterative process

importing
manipulating
plotting
transforming
aggregating
summarizing
modelling
reporting

Getting started with R

The R software

R is a statistical package for working with data (Ihaka and Gentleman, 1996; Venables and Ripley, 2002). It features most of modern statistical tools, and convenient functions to import, manipulate and visualize even high-dimensional data sets. There is usually no need to switch to another program throughout project management.

It is open-source (GPL licence), and it has a very active community of users (r-help, Stack Overflow).

Moreover, some mathematical or statistical packages (e.g., Mathematica, SPSS) feature built-in plugins to interact with R. It can also be used with Python (rpy2).

R is not a cliquodrome

R is interactive (read-eval-print loop) and can be used as a simple calculator:

r <- 5
2 * pi * r^2

## [1] 157.0796

Basically, R interprets commands that are sent by the user, and returns an output (which might be nothing). The syntax of the language is close to that of any programming language: we process data associated to variables with the help of dedicated commands (functions).

Note that many additional packages are available on http://cran.r-project.org, and we will use few of them.

RStudio

RStudio offers a clever, non-intrusive, interface to R, including specific panels for the R console, history, workspace, plots, help, packages management.

It now features project management, version control, and it has built-in support for automatic reporting through the Markdown language.

John Verzani (2011). Getting started with RStudio. O'Reilly Media.
Mark P.J. van der Loo, Edwin de Jonge (2012). Learning RStudio for R Statistical Computing. Packt Publishing.

A sample session

Interacting with R

Here is a sample session.

First, we load some data into R:

bs <- read.table(file = "../data/brain_size.dat", 
                 header = TRUE, na.strings = ".")
head(bs, n = 2)

##   Gender FSIQ VIQ PIQ Weight Height MRI_Count
## 1 Female  133 132 124    118   64.5    816932
## 2   Male  140 150 124     NA   72.5   1001121

Rectangular data set

Here is how the data would look under a spreedsheat program:

Variables are arranged in columns, while each line represents an individual (i.e., a statistical unit).

Querying data properties with R

Let's look at some of the properties of the imported data set:

dim(bs)

## [1] 40  7

names(bs)[1:4]

## [1] "Gender" "FSIQ"   "VIQ"    "PIQ"

There are 40 individuals and 7 variables. Variables numbered 1 to 4 are: gender, full scale IQ, verbal IQ, and performance IQ.

Querying data properties with R

Now, here is how observed values for each variable are represented internally:

str(bs, vec.len = 1)

## 'data.frame':    40 obs. of  7 variables:
##  $ Gender   : Factor w/ 2 levels "Female","Male": 1 2 ...
##  $ FSIQ     : int  133 140 ...
##  $ VIQ      : int  132 150 ...
##  $ PIQ      : int  124 124 ...
##  $ Weight   : int  118 NA ...
##  $ Height   : num  64.5 72.5 ...
##  $ MRI_Count: int  816932 1001121 ...

Variables are either numeric (continuous or discrete) or categorical (with ordered or unordered levels). Note that there exist other representations for working with dates, censored variables, or strings.

Testing, organizing, versioning

RStudio greatly facilitates the data analysis workflow: Rather than writing directly R commands in the console, write them in an R script and run the code from the editor.

References

[1] J. Chambers. Software for Data Analysis. Programming with R. Springer, 2008.

[2] R. Ihaka and R. Gentleman. "R: A language for data analysis and graphics". In: Journal of Computational and Graphical Statistics 5.3 (1996), pp. 299-314.

[3] R. Peng. "Reproducible Research and Biostatistics". In: Biostatistics 10.3 (2009), pp. 405-408.

[4] M. Selz. La représentativité en statistiques. Vol. 8. Méthodes et Savoirs. Publications de l'INED, 2013.

[5] W. Venables and B. Ripley. Modern Applied Statistics with S. 4th. ISBN 0-387-95457-0. Springer, 2002.