Here are twenty canonical questions when using “learning machines”, according
to Malley and co-authors: Malley, JD, Malley, KG, and Pajevic, S, Statistical Learning for Biomedical Data, Cambridge University Press (2011).
- Are there any over-arching insights as to why (or when) one learning
machine or method might work, or when it might not?
- How do we conduct robust selection of the most predictive features, and
how do we compare different lists of important features?
- How do we generate a simple, smaller, more interpretable model given a
well fitting but much larger one that fits the data very well? That is,
how do we move from an inscrutable, but highly predictive learning
machine to a tiny, familiar kind of model? How do we move from an
efficient black box to a simple open book?
- Why does the use of a very large number of features, say 500K genetic
expression values, or a million SNPs, very likely lead to massive
overfitting? What exactly is overfitting and why is it a bad thing? How
is it possible to do any data analysis when we are given 550,009
features (or, 1.2 million features) and only 132 subjects?
- Related to (4), how do we deal with the fact that a very large feature
list (500K SNPs) may also nicely predict entirely random (permuted)
outcomes of the original data; how can we evaluate any feature selection
in this situation?
- How do we interpret or define interactions in a model? What are the
unforeseen, or the unwanted consequences of introducing interactions?
- How should we do prediction error analysis, and get means or variances of
those error estimates, for any single machine?
- Related to (7), given that the predictions made by a pair of machines on
each subject are correlated, how do we construct error estimate
comparisons for the two machines?
- Since unbalanced groups are a routine in biology, for example with 25
patients and 250 controls
- Can data consisting of a circle of red dots (one group) inside an outer
circle of green dots (a second group) ever derive from a biologically
relevant event? What does this say about simulated data?
- How much data do we need in order to state that we have a good model and
error estimates? Can classical
- It is common that features are quite entangled, having high-order
correlation structure among themselves. Dropping features that
correlated with each other can be quite mistaken. Why is that? How can
very weak predictors, acting jointly, still be highly predictive when
acting together?
- Given that several models can look very different but lead to nearly
identical predictions, does it matter which model is chosen in the end,
or, is it necessary to choose any single model?
- Closely related to (13), distinct learning machines, and statistical
models in general, can be combined into a single, larger and often
better model, so what combining methods are mathematically known to be
better than others?
- How do we estimate the probability of any single subject being in one
group or another, rather than making a pure (0,1) prediction: “Doctor,
you say I will probably survive five years, but do I have a 58% chance
of survival or an 85% chance?”
- What to do with missing data occurring in 3% of 500K SNP dataset? In
15.4% of the data? Must we always discard cases having missing bits here
and there? Can we fill the missing bits and still get to sound inference
about the underlying biology?
- How can a learning machine be used to detect–-or define-–biologically
or clinically important subsets?
- Suppose we want to make predictions for continuous outcomes, like
temperature. Can learning machines help here, too?
- How is it that using parts of a good composite clinical score can be
more predictive than the score itself?
- What are the really clever methods that work really well on all data?
I must say I found these questions interesting, although the authors do not fully address some of them in this textbook (I mean they are merely put as guidelines to further study the field of ML for biomedical research, in my opinion). Anyway, I think starting an applied ML course with these questions might be very stimulating for people knowing nothing about large and imbalanced data set, overfitting, SVMs or Random Forests, etc.
I am currently reading Applied Predictive Modeling (hereafter, APM), by Max Kuhn and Kjell Johnson. This is a nice book, which serves as a good companion to their R package caret. I have been using this package for 4 years now, but I understand its architecture better now that I am progressing through APM chapters. Like Statistical Learning for Biomedical Data, the authors do not always provide an in-depth treatment of ML algorithms or statistical properties, but we feel confident that they really know what they are talking about, and the reader is gently explained why some methods are better suited than others, and what are the interest and limits of using specific predictive models to answer a specific research question.
See Also
»
Dose finding studies and cross-over trials
»
Exploratory data mining and data cleaning
»
Dicing With Death
»
Methods for handling treatment switches
»
Latest reading list on medical statistics