Data Science from Scratch


Yet another quick review of a book I purchased recently on O'Reilly; Data Science from Scratch, by Joel Grus. I will take advantage of this post to briefly review other related books.

Data Science from Scratch covers a number of statistical models (chapters 12 to 18), including: k-nn, naive Bayes, generalized linear models, decision trees, neural networks; additional chapters cover unsupervised learning (k-means and hierarchical clustering), natural language processing (text processing/classification and topic modeling), network analysis and recommender systems. Two extra chapters cover the basic of SQL and MapReduce. Overall, I find that the emphasis is put on applications and Python coding, rather than using Python to solve data science problems. A scratch course in Pyhton is offered in Chapter 2, and the author suggests to install Anaconda as so many do; I am, however, happy with the Python distribution that shipped with my Macbook (that may be just me, but Anaconda is creating a mess with various lib when using Homebrew and R). I don't think this chapter would be enough for a beginner in Python, though. Likewise, chapters on linear algebra and statistics are rather scarse, even if chapter 7 ('Hypothesis and Inference') discusses key elements of statistical inference. The graphics aren't so terrible because the author use Matplotlib with all its horrible default settings (hopefully there are now bokeh and seaborn!).

Overall, this book is for people who know Python and are interested into getting introduced to data science quickly, without all the details. The author is certainly very proficient in Python, and he surely knows what he is doing when analyzing data, but I was surprised to find so little discussion about statistical theory or data science as a whole. As it stands, this book looks more like a collection of recipes for data scientists interested in using Python.

Other titles of interest that I should probably describe in more details in future posts:

  • Data Science for Business. This book covers the core elements of predictive modeling, with an emphasis on "business problems" (decision-making problems that intermix data-driven approaches and cost/utility analysis) tree-based predictive modeling, classification and regression models, inlcuding non-linear models (SVM, NNs), a complete chapter on overfitting and cross-validation, similarity and distance-based clustering, assessing model performance with graphical displays and numbers, text-mining, some applications of decision analytic techniques. The authors suggest the following four principles: (1) "Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages."; (2) "From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest"; (3) "If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you’re looking at"; and (4) "Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used". Additional material can be found on GitHub: Practical Data Science Fall 2014 (the 2013 session is dead); see also Josh Attenberg's page on Tools of the Data Scientist.
  • Doing Data Science. It was reviewed by Joseph Rickert on the Revolutions R's blog. I find this is a fair review. I would like to add that the applications include social network analysis, fraud detection, recommendation engines, and data engineering (e.g., MapReduce). There's even a chapter dedicated to epidemiological studies, next to that on controlled experimental (as in, e.g., clinical studies or A/B testing). It comes with an example code repository on GitHub, and a blog which was used to reflect course progress and additional discussion from the live course. The good thing is that it was written by people with a background in statistics, and the authors are not speaking of 'data science' as a simple toolbox (data munging, machine elarning, data mining, etc.), but as a discipline; this is probably what Joseph Rickert meant when he wrote:

    Every once in a while a single book comes to crystallize a new discipline.


Articles with the same tag(s):

Stata for health researchers
R Graphs Cookbook
Bad Data
Data science at the command-line
Reproducible research with R
Twenty canonical questions in machine learning
Do a large amount of consulting
Dose finding studies and cross-over trials
Evidence-based medicine and clinical diagnosis
Exploratory data mining and data cleaning