Working with data

Christophe Lalanne

September 29, 2015

Synopsis

Data

The world of data

Data come in various forms, and they are gathered from many different sources:

  • personal daily records
  • observational studies
  • controlled experiments
  • historical archives
  • national surveys, etc.

They almost always require preprocessing, cleaning, and recoding—and this often amounts to 60% of the whole statistical project.

How to collect data

  • Simple random sampling: each individual from the population has the same probability of being selected; this is usually done without replacement (e.g., surveys, psychological experiment, clinical trials).
  • Systematic random sampling, (two-stage) cluster sampling, stratified sampling, etc.

In the end, we must have a random sample of statistical units with single or multiple observations on each unit. With this sample, we usually aim at inferring some properties at the level of the population from which this 'representative' (Selz, 2013) sample was drawn.

How to store data

Better to rely on plain text files (space, tab or comma-delimited) or database (SQL family) whenever possible:

  • easy access to data (in an OS agnostic way)
  • availability of reliable query languages
  • recording of each data step in text files
  • easy storage and sharing of data

The last two points are part of what is called reproducible research (Peng, 2009).

Data processing

  • full sample single CSV file (RFC4180)
  • binary data files (.mat, .sav, .dta)
  • individual data files, possibly compressed (require globbing)

Data can also be served through different files, e.g. dual files like Nifti .hdr+.img, or PLINK .bed+.bim (require merging).

Text-based data file are generally easy to process with many text processing programs, e.g., sed, awk, Perl (Chambers, 2008), but we will see that R offers nice facilities for that kind of stuff too.

A ready to use data set

Observers rated relatedness of pairs of images of children that were either siblings or not on a 11-point scale.

SimRating sibs agediff gendiff Obs Image
1 10.00 1 29.00 diff S1 Im1
2 10.00 1 37.00 diff S1 Im2
3 6.00 1 47.00 diff S1 Im3
4 1.00 1 44.00 diff S1 Im4
5 10.00 1 25.00 diff S1 Im5
6 9.00 1 0.00 same S1 Im6


Maloney, L. T., and Dal Martello, M. F. (2006). Kin recognition and the perceived facial similarity of children. Journal of Vision, 6(10):4, 1047–1056, http://journalofvision.org/6/10/4/.

Real-life data

Raw data generally require data cleansing.

id centre sex age session recover
2 1017.00 1.00 1.00 23 8 0.00
3 1023.00 1.00 126 6 1.00
4 2044.00 2.00 1.00 24 -1 1.00
5 2086.00 2.00 2.00 27 ? 0.00


What does recover = 0 or sex = 1 mean? Are negative values allowed for session? Leading zeros in IDs and centre number are lost. We need a data dictionary.

$ head -n 3 data/raw.csv
id,centre,sex,age,session,recover
,,,,years,
01017,01,1,23,8,0

Data analysis as an iterative process

  • importing
  • manipulating
  • plotting
  • transforming
  • aggregating
  • summarizing
  • modelling
  • reporting

Getting started with R

The R software

R is a statistical package for working with data (Ihaka and Gentleman, 1996; Venables and Ripley, 2002). It features most of modern statistical tools, and convenient functions to import, manipulate and visualize even high-dimensional data sets. There is usually no need to switch to another program throughout project management.

It is open-source (GPL licence), and it has a very active community of users (r-help, Stack Overflow).

Moreover, some mathematical or statistical packages (e.g., Mathematica, SPSS) feature built-in plugins to interact with R. It can also be used with Python (rpy2).

R is not a cliquodrome

R is interactive (read-eval-print loop) and can be used as a simple calculator:

r <- 5
2 * pi * r^2
## [1] 157.0796

Basically, R interprets commands that are sent by the user, and returns an output (which might be nothing). The syntax of the language is close to that of any programming language: we process data associated to variables with the help of dedicated commands (functions).

Note that many additional packages are available on http://cran.r-project.org, and we will use few of them.

RStudio

RStudio offers a clever, non-intrusive, interface to R, including specific panels for the R console, history, workspace, plots, help, packages management.

It now features project management, version control, and it has built-in support for automatic reporting through the Markdown language.

A sample session

Interacting with R

Here is a sample session.

First, we load some data into R:

bs <- read.table(file = "../data/brain_size.dat", 
                 header = TRUE, na.strings = ".")
head(bs, n = 2)
##   Gender FSIQ VIQ PIQ Weight Height MRI_Count
## 1 Female  133 132 124    118   64.5    816932
## 2   Male  140 150 124     NA   72.5   1001121

Rectangular data set

Here is how the data would look under a spreedsheat program:

Variables are arranged in columns, while each line represents an individual (i.e., a statistical unit).

Querying data properties with R

Let's look at some of the properties of the imported data set:

dim(bs)
## [1] 40  7
names(bs)[1:4]
## [1] "Gender" "FSIQ"   "VIQ"    "PIQ"

There are 40 individuals and 7 variables. Variables numbered 1 to 4 are: gender, full scale IQ, verbal IQ, and performance IQ.

Querying data properties with R

Now, here is how observed values for each variable are represented internally:

str(bs, vec.len = 1)
## 'data.frame':    40 obs. of  7 variables:
##  $ Gender   : Factor w/ 2 levels "Female","Male": 1 2 ...
##  $ FSIQ     : int  133 140 ...
##  $ VIQ      : int  132 150 ...
##  $ PIQ      : int  124 124 ...
##  $ Weight   : int  118 NA ...
##  $ Height   : num  64.5 72.5 ...
##  $ MRI_Count: int  816932 1001121 ...

Variables are either numeric (continuous or discrete) or categorical (with ordered or unordered levels). Note that there exist other representations for working with dates, censored variables, or strings.

Testing, organizing, versioning

RStudio greatly facilitates the data analysis workflow: Rather than writing directly R commands in the console, write them in an R script and run the code from the editor.

References

References

[1] J. Chambers. Software for Data Analysis. Programming with R. Springer, 2008.

[2] R. Ihaka and R. Gentleman. "R: A language for data analysis and graphics". In: Journal of Computational and Graphical Statistics 5.3 (1996), pp. 299-314.

[3] R. Peng. "Reproducible Research and Biostatistics". In: Biostatistics 10.3 (2009), pp. 405-408.

[4] M. Selz. La représentativité en statistiques. Vol. 8. Méthodes et Savoirs. Publications de l'INED, 2013.

[5] W. Venables and B. Ripley. Modern Applied Statistics with S. 4th. ISBN 0-387-95457-0. Springer, 2002.