Data cleaning techniques

Here are some resources to learn data cleaning techniques. While browsing Leanpub yesterday1, I came across this little book: Practical Data Cleaning, by Lee Baker. This is the second edition of a book that was updated on October 2015. The author assumes that data are managed using MS Excel, and emphasizes the importance of quality control and data trails. However, I find this book a little scarce as it merely fly over the main points of data cleaning and, in particular, does not discuss technical issues (except when suggesting some Excel function like TRIM() or SUBSTITUTE()).

EHealth, eTools and the like

Some months ago, I attended a conference that was held at the Cité Universitaire in Paris. It was about e-tools and social networks for epidemiology. The program and slides can be found on the website. The aim of this meeting was to overview current efforts and/or realizations to recruit and involve a maximum number of participants in a given cohort, secure adherence via social networks (Facebook, Twitter, etc.), and collect data in an undemanding way.

ODBC drivers on Mac OS X

A brief survey of ODBC and database connectivity on Mac OS X, since I wanted to test ODBC drivers on Stata (see How do I set up an ODBC Data Source Name for Stata on Mac or Linux/Unix?). What is available on OS X? Starting with Mac OS X version 10.6 (Snow Leopard), ODBC Administrator is no longer shipped with the operating system and must be downloaded and installed separately: ODBC Administrator Tool for Mac OS X v1.

Visualizing results from SQL queries

Most statistical or dedicated software can process data stored in SQL databases and plot results from specific queries. There are also custom applications that allow to display query results, like DbVisualizer. I recently installed a full-featured binary package of PostgreSQL,, which I find particularly convenient to work with database on my Mac since it generally comes with an up to date Postgres system, as well as PostGis and PLV8 (which I haven’t explored yet).

Overview of next gen database

The NoSQL paradigm isn’t a way to work with a database without havinf to connect to a server. It is merely a term coined to reflect new “non-relational” models for organizing data, through a distributed architecture (it is not mandatory, though), and it should not be confoudend with the existing software NoSQL. According to, NoSQL is Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable.

Venn diagrams and SQL joins in R

When browsing Tweeter feeds yesterday, I just noticed a post by J.D. Long (alias @CMastication) referring to a nice way of illustrating SQL joins statements with Venn diagrams by Jeff Atwood. So I wonder how it could be reproduced in R. I initially thought of hacking the venneuler package. However, it happens that I really need a few things, so that I just wrote a wrapper function that takes care of drawing two spheres and shading the appropriate areas.