(I know, I know, it’s a bit late for a January review.)

The Box-Cox transformation is one of the many tools one can use to ease traditional linear modeling of data.Semi-parametric regression, whihc I never really used for real work is also an option. Flexible regression methods, as emphasized in Harrell’s RMS textbook^{1} or the use of generalized additive models^{2} are another. I’ve been doing some research on orthogonal polynomial, restricted cubic splines, and other spline fitting procedures, and then I found this paper. It introduces different frameworks (Truncated Power, B-spline, and P-Spline), and a quick application on a time-series dataset.

Looks interesting at first sight, but I haven’t had time to read it. Maybe you will be interested, though.

Unlike the parametrizations classically used in Machine Learning, the parametrizations introduced here are all bijective and are even diffeomorphisms, thus allowing to keep the important properties from a statistical inference point of view, first of all identifiability

This is a 300-page long tutorial on probabilistic programming written by one of the authors of Anglican, which seems like a staled project at this time. Still, it is a very valuable account of probabilistic programming at large and a peculiar implementation targeting the JVM. See also these slides.

This is a rather long monograph on RDDs. It covers local randomization (which assumes that the assignment of units above or below the cutoff was as if random inside a given window and that in this window the potential outcomes are unrelated to the score), fuzzy RDD (which accounts for non-compliance), and discrete and multidimensional designs. There’s a preliminary monograph from the same authors that only deals with the classical RDD, as well as a practical for medical applications.

Programming is ubiquitous in applied biostatistics; adopting software engineering skills will help biostatisticians do a better job.

A few months ago, I read on a blog that statisticians were very bad at programming because they lack skills in software engineering. Let me suggest two additional remarks: applied statisticians must provide results, quickly whenever possible, and when they can use agile programming they do, most of the time. This does not mean they will deliver a product of industrial grade, but if they can reproduce their own results six months later, this already is good news. I guess many software engineers never looked at statlib or established algorithms; you can use TDD, CI, regression tests and the like; if your algorithm is far from the standards, it won’t do any good.

Sometimes, I feel like I’m reading the same story every year: “biomarkers”, measurement error, practical significant difference, etc. Ioannidis warned us a long time ago.

The authors hereby suggests “probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions.” His conclusion is that the Brier score is a good candidate. The problem with discrimination indices is that they perform poorly when the observed prevalence of the outcome is very low (which happens pretty often in medical applications) despite highly significant variables.

I learned about Dex on Darren Wilkinson’s blog. Later on, I learned about Futhark, but see A comparison of Futhark and Dex. I never used any one of these DSLs, but I warmly recommend reading what Darren Wilkinson has to say about those frameworks.

♪ Train • *Drops of Jupiter*