What’s up on the internet in January?
Learning with Privacy at Scale
This one is taken from the Apple ML Journal. Basically, the article deals with local differential privacy which refers to the anonymization of personal data on the local computer directly, and not on the server (i.e., after the data have been uploaded). I was pleased to learn that everything is done in order to ensure that this does not impact the device bandwidth (I already know that we can opt in or not).
Here are some interesting links I keep opened in my web browser for a while in December.
How to Write a Git Commit Message
This is a well-known article regarding the annotation of your current work or contribution in a version control system. As I tend to work mostly alone, I don’t have a strong need for Git except that it helps me to keep a trace of my workflow and it is quite useful to archive different version or deliverable of my work.
Yet another quick review of a book I purchased recently on O’Reilly; Data Science from Scratch, by Joel Grus. I will take advantage of this post to briefly review other related books.
Data Science from Scratch covers a number of statistical models (chapters 12 to 18), including: k-nn, naive Bayes, generalized linear models, decision trees, neural networks; additional chapters cover unsupervised learning (k-means and hierarchical clustering), natural language processing (text processing/classification and topic modeling), network analysis and recommender systems.
Here is a quick review of the fourth edition of An introduction to Stata for Health Researchers, by Svend Juul and Morten Frydenberg.
I have several books from Stata Press (and I’d love getting more of them). I always found them very well written, and they generally offer a good balance between theory and application. There is a companion website (other than material provided by Stata Press) as well. There is also some comments to exercices for the present book.
I just finished reading the R Graphs Cookbook (2nd ed.), by Jaynal Abedin and Hrishi V. Mittal, edited by Packt Publishing.
Not to be confused with the R Graphics Cookbook, or its companion website, Cookbook for R.
This is a basic introductory text on R graphics. Beyond building basic graphics, such as scatterplot, bar chart, or histogram, the authors show how to customise various elements of a statistical graphic using R base graphics (axis limits, axis labels, legend, etc.
I just finished reading the Bad Data Handbook, edited by Q. Ethan McCallum (O’Reilly, 2013). This is a nice book with interesting chapters on data munging and data verification.
Among the authors, you will find a long list of well-known statisticians and data hackers: Paul Murrell (besides R extensions to the grid package, he authored Introduction to Data Technologies), Richard Cotton (4D Pie CHarts), or Philipp K Janert, who wrote Data Analysis with Open Source Tools (O’Reilly, 2010), another great handbook on applied statistics using GNU software.
Here is a short review of Data Science at the Command Line, by Jeroen Janssens (O’Reilly, 2014).
First of all, I must say that I really enjoyed reading this book, although I had not much time to spend playing with the various examples. However, when you have a bit of knowledge of the command-line, sed/awk, GNU coreutils, and UNIX philosophy, it’s not hard to envision the power of a command-line approach to data analysis.
Here are some notes on user2014 (no it’s not one of the anonymous poster on Stack Exchange!). The GitHub homepage can be found at https://github.com/user2014.
I wish I could attend the conference but I am a bit short of time at the moment, so I’m just following its progress on Twitter. I just started reading material available on the conference site, especially the tutorials that were made available on line.
I just finished reading two recent books in the R Series from Chapman & Hall: Reproducible research with R and RStudio (Christopher Gandrud), and Dynamic documents with R and knitr (Yihui Xie).
Following my post on a good Workflow for statistical data analysis, I decided to take a look at the state of the art regarding the R statistical software. In fact, I’ve been using Sweave and knitr for a while now, and I tend to use knitr for everything but simple R scripts that can be self-contained.