Web review #1


Here are some interesting links I keep opened in my web browser for a while in December.

  • How to Write a Git Commit Message This is a well-known article regarding the annotation of your current work or contribution in a version control system. As I tend to work mostly alone, I don't have a strong need for Git except that it helps me to keep a trace of my workflow and it is quite useful to archive different version or deliverable of my work. I guess I could simply rely on a separate Changelog file, which are so easy to manage under Emacs (C-x 4 a). So, how to write a great commit message? Here are the seven rules that re discussed at length in this article:

    1. Separate subject from body with a blank line
    2. Limit the subject line to 50 characters
    3. Capitalize the subject line
    4. Do not end the subject line with a period
    5. Use the imperative mood in the subject line
    6. Wrap the body at 72 characters
    7. Use the body to explain what and why vs. how

    I believe the most important points are 1, 5 and 7. The remaining points are really formatting issues and probably relevant only for command-line users.

  • Exploring a Data Set in SQL I found this article while looking for a database to illustrate R's capabilities when working with remote or local database systems. Incidentally, I read one of the author's blog post few days ago, when he released his book on PostgreSQL. I always felt that I should spend more time investigating PostgreSQL and try to carry out basic statistical data munging using SQL from psql directly. The dplyr package is interesting because it let the SQL system manage the operation whenever possible, and we just get back or "collect" the results as data tibble in R. However, I guess a more universal approach (i.e. independant of the statistical package) whould be to process data and output aggregated data using SQL only.

  • Introduction à R et au tidyverse This is a French book(down) about the tidyverse ecosystem which seems to become the all-in-one approach to data processing in R these days--and I know this yielded very hot debate on Stack Overflow. Personally, I still use the not so old-fashioned base R, with some nifty additions from data.table. Anyway, as someone who unfortunately had to spend a fair amount of time away from his computer, I appreciate that people write dedicated tutorial on the "tidyverse" so that I can educate myself quickly. Yet, this is still work in progress and I will probably have to check back later. The Datacamp team also offers an introductory course to the "tidyverse" by the way. And, finally, here is an overview of tidy tools, pushed by Zev Ross and co-authors on PeerJ Preprints.

  • anthonyohare/SwiftScientificLibrary This looks like a handy library for numerical computing using the Swift language. Yet another port of the GNU Scientific Library for a general language (e.g., C, Common Lisp, Scheme, Python), although one can generally find some good native counterpart. In fact, I decided to spend more time from now on using Swift, Python, Clojure & Scheme, and Mathematica. I plan to write some teaching material on those languages for data mining in the next months. I don't know how far I can go with each language, but time will tell.

  • arunsrinivasan/user2017-data.table-tutorial A nicely illustrated keynote and several practicals on using R data.table with a strong emphasis on subset/select/group by operations on data tables. This is actually my preferred approach to data processing in R with moderate to large data files because it remains truly compatible with existing methods working on data frames exclusively and I never had to learn a lot new functions or change my own workflow, like is often the case with the "tidyverse" way of doing this or this, IMO.

  • Computational Linear Algebra for Coders These are some lecture notes (with videos available on Youtube) on efficient matrix computations using Python. This covers a wide range of applications, including non-negative matrix factorization and stochastic gradient descent, robust PCA, lasso and ridge regression, Cholesky factorization, and many more. See also the blog annoucement here: New fast.ai course: Computational Linear Algebra.

  • Python Itertools Recipes For those who seek a functional approach to dealing with iterator building blocks in Python.

  • Programming, meh… Let’s Teach How to Write Computational Essays Instead A followup post on Wolfram's latest article on writing computational essay, with emphasis on Jupyter notebooks for literate programming.

    A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …
    The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

  • Articulate Common Lisp A new website for getting started with Common Lisp. At the time of this writing, it merely consists of a series of links to external resources and some snippets of Lisp code to perform common tasks. I hope this will evolve toward more opiniated discussion around Lisp and, hopefully, numerical computing.

  • Introduction to Data Science Yet another book(down) on R for Data Science, a course originally taught on HarvardX, by Rafael Irizarry who already published a book on Leanpub.

  • IBM Plex A nice looking OTF font, for desktop and web, which comes in Mono, Sans and Serif version.

  • Learning probabilistic programming I found this series of exercices while looking for probabilistic programming on Google. This is base on the Anglican package, which in turn relies on Clojure. It didn't explore it much at the moment but I hope to do so in the near future.

  • Natural Language Understanding A Stanford course on NLP and language understanding, featuring topics like: sentiment analysis, relation extraction, language semantics and representations, language parsing, or pragmatic agents. Material is available as Jupyter notebooks, and the authors make extensive use of NLTK and [scikit-learn](http://scikit-learn.org/stable/**.

  • One LEGO at a time: Explaining the Math of How Neural Networks Learn I really love visual explanation and this article offers a nice discussion of backpropagation in neural networks, which as everyone probably knows have become popular again. More than static pictures, the article also includes dynamic visualization--much like on https://distill.pub, and Python code is available on GitHub.

  • apple/turicreate Some time ago, I followed an ML course on Coursera, which was really great even if it has been interrupted before the 5th course and final project. It was run by Emily Fox and Carlos Guestrin, from the University of Washington. They used Python and GraphLab to handle large datasets that were used to illustrate collaborative filtering or gradient descent algorithms. It looks like Turi Create also relies on GraphLab, and it aims at making easy to fit complex ML models and expose them to Core ML, making them available through iOS devices. More information on Core ML applications can be found on the following website: Quantum Tunnel Website.

Interesting websites: Michael BetancourtRobert GrantAudiograph


Articles with the same tag(s):

Web review #2
Readings on arXiv
Data Science from Scratch
Stata for health researchers
R Graphs Cookbook
Bad Data
Data science at the command-line
Reproducible research with R
Twenty canonical questions in machine learning
Do a large amount of consulting