Working with data

Christophe Lalanne
October 8th, 2013

Synopsis

The plural of anecdote is (not) data.
http://blog.revolutionanalytics.com/2011/04/the-plural-of-anecdote-is-data-after-all.html

What are data • How do we collect data • How do we store data • How to process data with R

The world of data

Data come in various forms, and they are gathered from many different sources:

  • personal daily records
  • observational studies
  • controlled experiments
  • historical archives
  • national surveys, etc.

They almost always require preprocessing, cleaning, and recoding—and this often amounts to 60% of the whole statistical project.

How to collect data

  • Simple random sampling: each individual from the population has the same probability of being selected; this is usually done without replacement (e.g., surveys, psychological experiment, clinical trials).
  • Systematic random sampling, (two-stage) cluster sampling, stratified sampling, etc.

In the end, we must have a random sample of statistical units with single or multiple observations on each unit. With this sample, we usually aim at inferring some properties at the level of the population from which this 'representative' (Selz, 2013) sample was drawn.

Now, here is how observed values for each variable are represented internally:

str(bs, vec.len = 1)
'data.frame':   40 obs. of  7 variables:
 $ Gender   : Factor w/ 2 levels "Female","Male": 1 2 ...
 $ FSIQ     : int  133 140 ...
 $ VIQ      : int  132 150 ...
 $ PIQ      : int  124 124 ...
 $ Weight   : int  118 NA ...
 $ Height   : num  64.5 72.5 ...
 $ MRI_Count: int  816932 1001121 ...

Variables are either numeric (continuous or discrete) or categorical (with ordered or unordered levels).

Note that there exist other representations for working with dates, censored variables, or strings.

Testing, organizing, versioning

RStudio greatly facilitates the data analysis workflow: Rather than writing directly R commands in the console, write them in an R script and run the code from the editor.

rstudio

References

Chambers J (2008). Software for Data Analysis. Programming with R. Springer.

Ihaka R and Gentleman R (1996). “R: A language for data analysis and graphics.” Journal of Computational and Graphical Statistics, 5(3), pp. 299-314.

Peng R (2009). “Reproducible Research and Biostatistics.” Biostatistics, 10(3), pp. 405-408.

Selz M (2013). La représentativité en statistiques, volume 8 series M\'ethodes et Savoirs. Publications de l'INED.

Venables W and Ripley B (2002). Modern Applied Statistics with S, 4th edition. Springer. ISBN 0-387-95457-0.