What’s going on on arXiv these days?
Here is my reading list for the past couple of weeks.
arXiv:1801.00631 Deep Learning: A Critical Appraisal, Gary Marcus
This is a brief review of deep learning in the light of the “renewed interest” for artificial intelligence that emerged during the last two years. I believe it does not reflect the opinion of all researchers or practicioners, but anyway there are some interesting references in there, and as the author said this is deliberately oriented toward AI research, not machine learning or data science.
Yesterday was my last day of academic teaching.
Although I keep doing in-house or company training (statistical computing only), I’m done with my regular lectures for this year. It was a long journey since October as I was in charge of 4 courses (around 90 hours in total), plus extra training for a private company. During the same period, I managed to work on two textbooks (basically related to this course) while working hard to get things done at my work.
Here are some resources to learn data cleaning techniques.
While browsing Leanpub yesterday1, I came across this little book: Practical Data Cleaning, by Lee Baker. This is the second edition of a book that was updated on October 2015. The author assumes that data are managed using MS Excel, and emphasizes the importance of quality control and data trails. However, I find this book a little scarce as it merely fly over the main points of data cleaning and, in particular, does not discuss technical issues (except when suggesting some Excel function like TRIM() or SUBSTITUTE()).
I spent the last month working hard to finish my book on biomedical statistics using R.
It has been a pleasuring experience since I wrote this book using Emacs, R, and git. In fact, I spent several days and nights with a screen looking like the one shown below (this is Homebrew Emacs in fullscreen mode):
Of course there are many things to fix and I’m not totally happy with the book as it stands.
I just finished reading the Bad Data Handbook, edited by Q. Ethan McCallum (O’Reilly, 2013). This is a nice book with interesting chapters on data munging and data verification.
Among the authors, you will find a long list of well-known statisticians and data hackers: Paul Murrell (besides R extensions to the grid package, he authored Introduction to Data Technologies), Richard Cotton (4D Pie CHarts), or Philipp K Janert, who wrote Data Analysis with Open Source Tools (O’Reilly, 2010), another great handbook on applied statistics using GNU software.
Statistical questions in evidence-based medicine was written long ago by Martin Bland and Janet Peacock (Oxford University Press, 2000), but is is still full of good recommendations. It is a series of case-based questions and answers.
It is supposed to serve as a companion to An introduction to Medical Statistics. I personally bought the third edition of Indrayan’s Medical Biostatistics, which I find very good, and I just ordered Medical Statistics in Clinical and Epidemiological Research, by Marit B.
Here are twenty canonical questions when using “learning machines”, according to Malley and co-authors: Malley, JD, Malley, KG, and Pajevic, S, Statistical Learning for Biomedical Data, Cambridge University Press (2011).
Are there any over-arching insights as to why (or when) one learning machine or method might work, or when it might not? How do we conduct robust selection of the most predictive features, and how do we compare different lists of important features?
A nice quote from one of the latest AMSTAT News columns.
A strategy I discourage is “develop theory/model/method, seek application.” Developing theory, a model, or a method suggests you have done some context-free research; already a bad start. The existence of proof (Is there a problem?) hasn’t been given. If you then seek an application, you don’t ask, “What is a reasonable way to answer this question, given this data, in this context?
In the context of a statistical project, “sanity checking” refers to the verification of raw data: whether they make sense, if there are any coding errors that are apparent from the range of data values, or if some data should be recoded or set as missing (Baum, CF, An introduction to Stata programming, Stata Press, 2009, p. 79).1
This should be recorded in an audit trail. For a general overview in the IT domain, see Rajeev Gopalakrishna’s tutorial.
Here are some notes on cross-over trials and within-patient titration in indirect assays.
Most of my litterature review started with Senn’s textbook, Statistical Issues in Drug Development (Wiley, 2007, pp. 317-336). Contrary to direct assays where we know what we want to achieve, say a given response to treatment, and we adjust the dose until this goal is reached, in indirect assays individual response is studied as a function of the dose and a ‘useful dose’ is decided upon afterwards.