Exploratory data mining and data cleaning

I just got my copy of Exploratory Data Mining and Data Cleaning, by Dasu and Johnson (Wiley, 2003). This is quite an old book but it offers a nice overview of common techniques to gauge and enhance data quality with exploratory data analysis. I learned about DataSphere partition,(1,2) for instance. This book is, however, not about tools to perform data cleaning or data analysis, like Janert’s Data Analysis with Open Source Tools (O’Reilly, 2001) which presents gnuplot, Sage, R, or Python and offers a small Appendix on Working with Data.

Dicing With Death

I just finished reading a very nice book from Stephen Senn: Dicing with death. Chance, Risk and Health (Cambridge University Press, 2003). It is very nicely written; I am no expert in English but I found that the quality of the writing gives a special sense to statistical issues discussed in this book. I already have another book from Stephen Senn: Senn, S.S. (2008). Statistical Issues in Drug Development (2nd ed.

User-friendly statistical packages

Here is a newcomer in the world of “user-friendly” statistical packages: Wizard (Mac only). This software features an all-in-one approach to data analysis and visualization: you start with a spreadsheet of your raw data, and you can ask for a description of it with two clicks (select a ‘summary’ button, and then select a variable. Several graphics are proposed, depending on the type of data, but sadly pie charts are offered as default for categorical data.

Methods for handling treatment switches

In a recent statistical seminar I attended, there was a discussion on statistical strategies to cope with treatment switching and the estimation of survival. The presentation went around slides from Ian White :1 Methods for handling treatment switching: rank-preserving structural nested failure time models, inverse- probability-of-censoring weighting, and marginal structural models , which are taken from the HTMR network workshop on Methods for adjusting for treatment switches in late-stage cancer trials.

Workflow for statistical data analysis

Few days ago, I came across Oliver Kirchkamp’s Workflow of statistical data analysis which provides essential tips and guidelines for managing not only data but the whole analysis flow (from getting raw data to publishing papers). For R users This is a very comprehensive textbook, with illustrations in R. I already dropped some words two years ago in How to efficiently manage a statistical analysis project. People with little knowledge of R can skip chapter 2 and go directly to chapter 3 which goes to the heart of the problem: How to create and manage a statistical project in R?

Latest reading list on medical statistics

A bunch of papers in early view from Statistics in Medicine suddenly came out in my Google feed reader. Way too many to tweet them all, so here is a brief list of papers I should read during my forthcoming week off. Some papers come from a special issue; others are ordinary research papers. Vexler, A., Tsai, W-M., Malinovsky, Y. Estimation and testing based on data subject to measurement errors: from parametric to non-parametric likelihood methods.

Fitting genetic models for twin studies

In their 2008 paper entitled “Biometrical modeling of twin and family data using standard mixed model software”,(1) Sophia Rabe-Hesketh and coll. described a mixed-effects model approach to fitting ACE structural equation models to twin data using Stata GLLAMM. This was also discussed in a recent Stata meeting that was held in Germany. Other papers (not very exhaustive–I have a lot more paper on my HD) dealing with the mixed-effects approach are listed in the bibliography section.

Kiefer's Introduction to statistical inference

I just received my copy of Introduction to statistical inference, by Jack C. Kiefer (Springer, 1987).1 After having read the first two chapters I wonder: How come I didn’t start with that book when I was studying elementary statistics! I recently had to give a short series of lectures on statistics (at a very introductory level) where I started with an illustration of a coin tossing experiment (starting at slide #7).

Ensemble Methods in Data Mining

This is a short review of Ensemble Methods in Data Mining, by G. Seni and J. Elder (Morgan & Claypool, 2010). I won’t get over the whole textbook, but rather summarize the introductory chapter which provides a nice overview of how ensemble methods work and why they are interesting from a predictive perspective. As it is well known, there is a large variety of DM algorithms whose predictive accuracy depends on the problem at hand.

Subgroup Analysis

Here are four papers dealing with the reporting of subgroup analysis and using baseline data (and their pitfalls). There might be plenty of other papers on this topic available through Google, but these ones focus on RCTs and biomedical research. Below is just a few recap’ of the critical points raised in these papers. A brief overview Wang et al.(1) provide general guidelines for reporting subgroup analysis. Based on their insert p.