ArXiv

This is a brief review of deep learning in the light of the “renewed interest” for artificial intelligence that emerged during the last two years. I believe it does not reflect the opinion of all researchers or practitioners, but anyway there are some interesting references in there, and as the author said this is deliberately oriented toward AI research, not machine learning or data science. The author summarizes ten concerns regarding deep learning, which essentially turn around the generalizability issue. Here are some of them: data hungriness and no capacity to learn abstractions, limited capacity to transfer knowledge (beyond the scope that was used to learn the model), no way to deal with hierarchical structure or open-ended inference, black-box approach, no or poorly integration of prior knowledge, unresolved causation versus correlation issue and assumption of a stable world. In the last section, the author suggests some ideas to go forward: combined use of deep learning and unsupervised learning, use of hybrid model to integrate symbolic information, and gain more insights from cognitive psychology.

There was also a hot debate during NIPS 2017 on the interpretability issue in machine learning, see, e.g. Yann Lecun’s reply on Reddit.

arXiv:1712.09988 Orthogonal ML for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels, Victor Chernozhukov • Matt Goldman • Vira Semenova • Matt Taddy

The authors discuss Machine Learning (ML) methods that can be used to facilitate model selection and the estimation of treatment effect in economic policy analysis. In particular, they emphasize the double ML framework which they previosuly demonstrated to be useful to partial out the influence of potential controls. Basically, this amounts to estimate log price and sales conditional on lagged values of those variables and then apply some penalization (e.g., Lasso) in a second stage where we want to explain heterogeneous demand elasticities by regressing residuals of the first model on indicators for location, channel, product or product type.

The study of heterogeneous treatment effects is not limited to econometrics but applies to observational data that have to deal with causal inference and the idea of “potential outcomes”. It can be shown that the naive estimator of the average treatment effect (ATE) can be decomposed as the sum of E[δ_i] and pre-treatment and treatment-effect heterogeneity biases (See Heterogeneous Treatment Effect Analysis, by Jann, or Estimating Heterogeneous Treatment Effects with Observational Data, by Xie et al.). It is then essential to eliminate those bias, especially the first one.

arXiv:1703.00864 The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings, Krzysztof Choromanski • Mark Rowland • Adrian Weller

This article deals with embedded methods in Machine Learning, but not really the one I was thinking about (i.e., embedded methods that perform variable selection during the learning phase of an ML algorithm) when I downloaded the paper. As many NIPS papers, it is rather technical and I had a very hard time following it. The core idea is that random ortho-matrices, that is structured gaussian or SD-product random matrices with orthogonal rows, can be used for dimensionality reduction (via orthogonal Johnson-Lindenstrauss Transform) or improving mean squared error of prediction (via angular kernel, which I never heard about before). I really have no idea of what I could potentially do with this theoretical contribution but it was a fun reading.

arXiv:1801.00371 Data Science vs. Statistics: Two Cultures?, Iain Carmichael • J.S. Marron

Personally, I never liked the term “data scientist” and I was quite happy with what Karl Broman put on his blog post:

In the present article, the authors define data science as the union of the following domains, after David Donoho’s article, 50 Years of Data Science: (1) data gathering, preparation, and exploration, (2) data representation and transformation, (3) computing with data, (4) data modeling, (5) data visualization and presentation, and (6) science about data science.

Many blog or online discussion are referenced throughout the article, as well as more scholarly published work. The authors emphasize the need for effective teaching of scientific reasoning, data visualization and statistical computing, and the importance of reproducible research, including literate programming, unit testing, or versioning, in applied research. Differing paradigms are also highlighted in this text: predictive and explanatory modeling, exploratory versus explanatory approaches. I learned about the 80/20 rule (this is not what I initially thought, that 80% of our time is spent in data munging and data exploration): “The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%.” The rest of the article is a rejoinder to Nolan and Lang’s paper, Computing in the statistics curricula. You might also be interested in reading this article by Hardin and coll.: Data Science in Statistics Curricula: Preparing Students to “Think with Data”.

A final note: while I was looking for a certain reference via Google, I found the following PeerJ paper: Teaching stats for data science, written by Daniel T Kaplan. The ten building blocks (data tables, data graphics, model functions, model training, effect size and covariates, displays of distributions, bootstrap replication, prediction error, comparing models, generalization and causality) that the author recommend to start teaching “data science” are merely related to statistical computing and statistical inference, using the R statistical package, as far as I can tell, and so it is likely that the apparent lack of a consensual definition of data science still remain an open issue in 2018.

arXiv:1801.00364 Estimation and Inference of Treatment Effects with L₂-Boosting in High-Dimensional Settings, Ye Luo • Martin Spindler

Per the paper abstract, the authors “give results for the valid inference of a treatment effect after selecting from among very many control variables and the estimation of instrumental variables with potentially very many instruments when post- or orthogonal L₂-Boosting is used for the variable selection.” There are some connections with the arXiv:1712.09988 paper discussed above, in particular the estimation of treatment effect in high-dimensional settings relies on orthogonalized moment conditions, which are moment conditions imposed by a statistical model to estimate unknown parameters from a model using the generalized method of moments.

arXiv:1801.00753 Probabilistic supervised learning, Frithjof Gressmann • Franz J. Király • Bilal Mateen • Harald Oberhauser

This is a very big paper (105 pp.) about probabilistic predictive modeling using supervised methods implemented in a Python library, skpro. This statistical library relies on sklearn and PyMC3 and it basically allows (per the doc):

arXiv:1404.4178v6 Speeding Up MCMC by Efficient Data Subsampling, Matias Quiroz • Robert Kohn • Mattias Villani • Minh-Ngoc Tran

This is a slightly revised version of a paper submitted almost 4 years ago that deals with efficient sampling of posterior distribution using MCMC. So-called scalable MCMC relies on either the MapReduce scheme or subsampling approach, which the authors used in their article with the addition of pseudo-marginal MCMC in the case of intractable likelihood: “at each iteration the log-likelihood from n observations is estimated unbiasedly from a random subset with m ≪ n observations, and the resulting likelihood estimate is then bias corrected to obtain an approximately unbiased estimate of the likelihood.”

arXiv:1608.04481 Lecture notes on randomized numerical linear algebra, Michael W. Mahoney

ArXiving on March 2018

Let us celebrate this new Spring with some papers fetched from arXiv, not necessarily from the last weeks but at least in reverse chronological order in my Twitter timeline.

Crisan, A., Munzner, T., & Gardy, J. L. (2018). Adjutant: an R-based tool to support topic discovery for systematic and literature review

A short paper describing a Shiny application allowing to fetch data from Pubmed and to apply topic clustering on the results using t-SNE followed by the hdbscan (R package) algorithm. Importantly, data can be saved locally if the user want to perform additional analyses.

Nuijten, M. B., Van Assen, M. A. L. M., Augusteijn, H. E. M., Crompvoets, E. A. V., & Wicherts, J. M. (2018). Effect sizes, power, and biases in intelligence research: a meta-meta-analysis

Michèle B. Nuijten is the co-author of the statcheck R package, which offers to verify p-values reported in scientific articles. It was created to help assist in meta-analyzing a bunch of papers in the psychological litterature; see Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, 48(4), 1205-1226.

In the present paper, the authors assessed effect size and power reported in 131 meta-analyses in intelligence research. As expected from this type of research field where psychometric scales are used, Pearson’s correlation was often used to report effect size (42 studies) even if it might be wiser to use disattenuated correlation measure as recommended in the psychometric literature, or Cohen’s d (for comparison of pairs of means). All effet size measures were expressed as Pearson’s r. The results can be checked online. Correlations were found to be on average around 0.25, with an achieved statistical power just below 50% (only one third of all 2,439 primary studies reach the 80% power threshold) with worser results in behavorial genetics studies .

Lehman, J., Clune, J., Misevic, D., Adami, C., Beaulieu, J., Bentley, P. J., Bernard, S., … (2018). The surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities

This is a big discussion of evolutionary algorithms which reminds me of The Nature of Code, by Daniel Shiffman (available on GitHub as well). I enjoyed reading his book two years ago, and now I am watching Daniel Shiffman on the coding train on Youtube from time to time.

In the present review, the authors discuss several lessons learned from bugs or failed attempts at demonstrating an expected outcome in scientific studies and which turned to bring interesting new ideas. They are presented as sort of “greatest hits” collection of anecdotes, according to the authors. For instance, here is an example that people involved in randomized clinical trials would love: In an experiment involving reinforcement learning using neural networks, the authors forgot to randomize the order of the stimuli such that in the end the neural network just learned to exploit the regularity of the pattern rather than to tune its connections to the edibility of each item.

Mehta, P., Bukov, M., Wang, C., Day, A. G. R., Richardson, C., Fisher, C. K., & Schwab, D. J. (2018). A high-bias, low-variance introduction to machine learning for physicists.

Sometimes you come across some big work on arXiv. This is it! Here is a fresh new introduction to Machine Learning techniques, in a hundred of pages, for physicists this time. The companion website has everything to get started with iPython notebooks and the scikit-learn library. The applications cover basic algorithms like gradient descent and popular statistical methods: (regularized) linear and logistic regression, bagging and boosting, deep neural network using Keras, TensorFlow or PyTorch, and even variational auto encoders using Keras. I should note that the authors introduce ML as a subfield of AI without further ado (why is this is not considered as a subfield of statistics?) and that they “focus on prediction problems and refer the reader to one of many excellent textbooks on classical statistics for more information on estimation” (other than prediction, I have a hard time believing that black-box models like bagging and boosting can offer anymore than predictive performance over well tuned regression models).

Drovandi, C. C., Grazian, C., Mengersen, K., & Robert, C. (2018). Approximating the likelihood in approximate bayesian computation

This paper presents alternative approaches to approximate bayesian computation (ABC) which allows to compute a non-parametric estimate of the likelihood of a summary statistic based on computer simulation. This covers a bayesian version of the synthetic likelihood method proposed by Wood (2010) and the non-parametric empirical likelihood developed by Mengersen et al. (2013) The latter allows for the use of Bayes factor in model comparison and it also avoid computational simulation under certain circonstances.

Morris, T. P., White, I. R., & Crowther, M. J. (2017). Using simulation studies to evaluate statistical methods

This is a nice review of computational approaches to statistical inference with an emphasis on providing guidance for newcomers and experienced practitioners in biomedical research. Key concepts are detailed in the article but here are the main steps: planning and aims of the simulation study, methods, target of analysis, performance measures, coding, analysis, reporting. Regarding performance measures, it is interesting to note that among 100 papers published in Statistics in Medicine, bias and coverage of (1-\alpha) confidence intervals were the most frequently indicators studied by the authors, while type I error rate was only assessed in one-third of the studies. A working example based on the proportional hazards model (§7) is used to illustrate the various steps of a simulation study.

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon Statistical Significance

Kim, Y., & Li, J. (2018). Serverless data analytics with flint

Flint depends on AWS (S3 bucket) and Lambda. It allows you to use Spark without having to worry about installing Spark on your local machine, while Flintrock is a command-line utility that allows to launch and manage a cluster on EC2.

AWS Lambda is one of the next-generation service from Amazon. It let you run any code in a serverless environment according to a pay-as-you-go cost model: Amazon takes care of allocating ressources and managing parallel execution of your code. Moreover the execution can be triggered by external event, like in Travis CI. As an example, consider an image processing pipeline where each time an user upload an image on an S3 bucket, a Lambda to generate an image thumbnail is triggered. For a standalone application, take a look at Seth Brown BeerAI recommender system.

Aaronson, S. (2011). Why Philosophers Should Care About Computational Complexity

Bacry, E., Bompaire, M., Gaïffas, Stéphane, & Poulsen, S. (2018). Tick: a python library for statistical learning, with a particular emphasis on time-dependent modelling

Tick is a new Python package for Machine Learning of time-dependent events, including Hawkes processes (I know nothing about those processes but I like the name : smile :; for more details, see this review). It also features penalized linear and logistic regression and the paper includes a small benchmark of those models against their scikit-learn counterparts.

ArXiving on May 2018

Here is a quick overview of some papers from the ArXiv. (And yes it was supposed to be published in May, and yes it is partly incomplete.)

David M. Blei, Alp Kucukelbir & Jon D. McAuliffe (2016). Variational Inference: A Review for Statisticians

When one wants to approximate a complex (posterior) probability density, Variational Inference is a great technique to find the closest probability density as measured by Kullback-Leibler divergence. Here is a short introduction on Princeton for those who want it short (see also this tutorial). Otherwise, I highly recommend spending some time with this review which deals with Bayesian mixture of Gaussians and mean-field variational family.

Hanqing Zhao & Yuehan Luo (2018). An O(N) Sorting Algorithm: Machine Learning Sorting

In this reasearch paper, the authors present a porbabilistic-based algorithm with a complexity of \mathcal{O}(N\cdot M), where M is small constant reflecting the number of neurons in the hidden layers of a Neural Network. Note that radix sort and counting sort are also \mathcal{O}(N) (up to a constant k, whether it be the number of digits to sort or simply their range). Note also that Don Knuth wrote a fabulous book on sorting algorithms. Now, this is the era of big data and machine learning, so if you care about those hot topics, check for yourself.

Gavin L Simpson (2018). Modelling palaeoecological time series using generalized additive models

I read this preprint because it is written by a very bright guy I followed on Stack Exchange and his personal website for a long time now. I heard about that paper on Twitter. As I now he is quite proficient with R and well versed into the Generalized Additive Model (GAM), I was happy to take some time to review that paper. If you are interested in this kind of stuff, you should really take a look at his blog. This paper deals with statistical analyses of trends in palaeoenvironmental time series using GAMs. This family of models allows not only to provide smoothand cross-validated estimates of trend in time, but also to assess model uncertainty and, to some extent, to test formal hypothesis between alternate models. This paper serves both as an introduction to using GAMs in a correct and sensible way, but also as a hand on tutorial with real datasets.

Thomas Hofmann, Bernhard Schölkopf & Alexander J. Smola (2008). Kernel Methods in Machine Learning

This is an old publication that I found while looking to Bernhard Schölkopf’s webpage on MPI. In a few words,

Overall, I found that this article provides a thorough overview of kernel methods, and if you want an extended version, go check the definitive book written a few years before by the same authors:

Victor Dibia & Çagătay Demiralp (2018). Data2Vis: Automatic Generation of Data Visualizations Using Sequence to Sequence Recurrent Neural Networks

This is a recent paper on automating the data visualization process, not from IDL this time. I always take a look around what the IDL lab from the University of Washington recently released, in case there is something new and as big as Data Voyager (paper). In my view, designing effective visualizations and guiding users in visual data explorations are separate aspects of automatic data visualization, although both rely on recommender systems to some extent. Several data viz software rely on a declarative specification, in the spirit of Wilkinson’s Grammar of Graphics, see e.g. the R ggplot package and Vega Javascript implementation. In the present paper, the authors used Deep Neural Networks (DNN) to build a generative model relying on the idea of a sequence to sequence translation problem. In fact, the training dataset they used was based on Vega-Lite examples for a range of basic charts and aggregating operations. Data2Vis is available online, but I wasn’t able to run the demo version due to server errors at the time of this writings.

ArXiving on July 2019

Here is a quick wrap up of my latest readings on arXiv. In no particular order, as usual.

SciPy 1.0—Fundamental Algorithms for Scientific Computing in Python (Virtanen et al., 2019) arXiv:1907.10121v1

Everyone doing scientific computing in Python knows ScipPy, which reaches its v1 a few months ago, after years of development. Of course, this milestone was more of new stage in the continuous work that drives the Scipy package since its inception. In this paper, the authors explains the motivation behind SciPy, its relation with the numerical and graphical ecosystem already available in Python or under development (Numpy, IPython), what its 16 submodules containand why open-source developement for scientific computing using a general-purpose language really matters. Like R, a significant part of the Scipy package features Fortran and C code. Note that this paper does not discuss the scikit-* packages, which surely contributed over the years to enhance the data science stack available in Python.

Techniques for Automated Machine Learning (Chen et al., 2019) arXiv:1907.08908v1

While I first learned about automated ML and related techniques via H2O.ai and a few other packages, it looks like the rise of AI-related research (e.g., deep learning), the need to combine several classifiers (like in ensemble methods) or to apply complex preprocessing or feature enginering steps before building up a reliable model, and probably a bit of black-boxing all call for new method of managing ML models. In short, automated ML allows to build flexible pipeline that can perform automatic tuning of feature engineering, model hyperparameters, or even the deep learning machinery itself. Reinforcement learning and evolutionary algorithms (the same techniques we were hearing from back in the late era of AI in the 90’s) are popular techniques for automated ML, especially in the case of feature engineering. This review remains quite technical, and assumes you are already a good ML practitioner, IMHO.

∂P : A Differentiable Programming System to Bridge Machine Learning and Scientific Computing (Innes et al., 2019) arXiv:1907.07587v2

Julia is a nice language, but I have never really been able to get involved in it. I mean, there are so many things that changed between version 0.4, 0.5, then 0.6, and now 0.7/1.0 that I didn’t really have time to follow or relearn what I was doing at the beginning (i.e., when everything had to be installed manually from Github). Anyway, it is a good thing that there are some good contenders to R or Python on the Market. In this paper, the authors describes ∂P, which is a system built on top of Julia to perform automatic differentiation–you know, that little thing that is used to transform code for one function into code for the derivative of that function. I am not particularly involved in AD, but I always find interesting to read about the potential of a PL in a particular field of applications, even if it is not my business–call this curiosity, or procrastination, or whatever. Some parts are very technical, either on the math side or on the Julia-related data structures, but the examples cover neural networks, computer vision, financial derivatives, and even quantum ML (never heard of it before reading that paper), and they are quite illustrative of how the ∂P system actually works.

Generalised linear models for prognosis and intervention: Theory, practice, and implications for machine learning (Arnold et al., 2019) arXiv:1906.01461

As I said in my previous post, the all-in-one blind approach to ML has been challenged by Frank Harrell and ‘traditional’ statisticians at numerous occasions. Maybe, as some used to say, that Breiman’s “two cultures” paper hasn’t aged so well after all. This article deals with the application of GLMs in prognosis research, where statisticians are more concerned with prediction than explanation.

This is indeed making a strong case of ML in health research, and I haven’t heard of any significant result in this direction so far. But gain, go check Frank Harrell discussion on this hot topic. The authors, however, chose to discuss the subtle issue of conflating prediction with causation, which even in the case of regression models is often overlooked: GLMs remain blind to the causal structure of the data, so do ML models. The authors then provide a thorough exposition of how to select and parameterize covariates in a predictive or causal model, and how each model may be assessed and interpreted in turn. It turns out that ML reasearch should also take into account this fine distinction, for “much of the promise of ML is predicated on prediction which prioritises the strength of (undirected) associations over the causal structures that give rise to them, and subsequently prioritises data-driven over theory-driven model selection.”

Please Stop Permuting Features An Explanation and Alternatives (Hooker et al., 2019) arXiv:1905.03151v1

TLDR; When features in the training set exhibit statistical dependence, permutation methods can be highly misleading when applied to the original model.

Permutation-based measures of variable importance, partial dependence plots, and individual conditional expectation plots are cheap and parameter-free techniques to assess the predictive support of variables, or features, in a given ML algorithm. However, they are known to exhibit serious flaws when there exist intercorrelations between features. I used to illustrate some of the problem of partial dependence in a previous talk. Alternatives do exist, though, e.g. permutation measures of variable importance based on conditional distribution, leave-one-covariate-out technique, but they stick to the permute-and-repredict framework, while retraining a model after permuting feature of interest might provide an interesting way to reduce the importance of correlated features, albeit at a larger computational cost.

ArXiving on March 2020

The Book of Why: The New Science of Cause and Effect arXiv:2003.11635v1

This is a fair review of the Book of Why that I started reading a while back before forgetting about it because I’m no longer involved in causal analysis, and not curious enough to keep up with everything that’s going on around here. Anyway, graphical causal representation under the umbrella of direct acyclic graphs (DAGs) are supposed to be the new starting point for causal analysis. The reviewers remain, however, critical of the Judea Pearl’s optimism since his approach mostly relies on the assumption that the underlying causal structure is known.

I would say that the same criticism applies to all regression models insofar as the (frequentist) interpretation value is valid (and meaningful) conditional on the fact that the model is correctly specified. Another criticism, this time, aimed at denigrating the usefulness of the experimental approach, like randomized clinical trials or experimental design at large. Finally, DAGs are not the only appraoch to causal modeling, according to this review, for several alternative causal models, like nonparametric structural equation models or design-based approaches, exist and may be able to answer the specifics of causal discovery.

An Introduction to Probabilistic Programming arXiv:1809.10756

I’ve read several papers or arXiv reviews on probabilistic programming (PP) lately. I mentioned this one already. Briefly, it provides an extensive review of the core ideas of PP, namely the importance of conditioning in probabilistic machine learing and the use of dynamic computation graphs. Interestingly, this uses the Clojure dialect as the main implementation. This should come as no surprise since some of the authors are actively involved in the Anglican toolbox.

A Survey of Deep Learning for Scientific Discovery arXiv:2003.11755v1

This review is structured along two ideas: to provide an overview of many widely used deep learning models, spanning visual (CNNs), sequential (RNNs) and graph structured data, associated tasks (image segmentation, super-resolution, sequence to sequence mappings and many others) and different training methods, and to highlight techniques that use deep learning (DL) with less data (e.g., self-supervision or semi-supervised learning) and better interpret these complex models. In my own view, DL is a useful technique when some kind of regularities can be exploited in the data — including bias if any, and when the training size is large. This generally holds in the case of pattern recognition tasks like NLP or image processing, but I remain sceptical about the application of this type of model in tasks for which more traditional supervised methods (e.g., logistic regression) would be just as suitable. There are also lot of links regarding tutorials and software, that are not necessarily limited to DL.

Reinforcement Learning in Economics and Finance arXiv:2003.10014v1

I have no competency in finance, and I barely have notions in economics, but I always enjoy reading one of Arthur Charpentier’s review, for they are always very picky and highly anchored in historical considerations. The last one I read was about Machine Learning, and it was really englightening. I am in the middle of the present review, and I already learned a lot on the parallel between traditional ML and RL with respect to loss (regret) minimizaetion, bandits, dynamic programming, or even Thompson sampling strategy.

ArXiving on June 2021

Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.

Scalable Econometrics on Big Data – The Logistic Regression on Spark (https://arxiv.org/abs/2106.10341)

I managed to try some Spark and H20 examples a long time ago, but this was really for toy projects. I’m usually happy with R or Stata for statistical modeling. However, this article brings interesting insights into large scale data processing using Spark, together with an empirical study of Python+Spark ML versus R in the case of logistic regression. Apparently, the new Spark ML library exposes two different versions of the logistic regression, one direct method (ml.classification.LogisticRegression) and the general linear model approach (ml.regression.GeneralizedLinearRegression).

maars: Tidy Inference under the ‘Models as Approximations’ Framework in R (https://arxiv.org/abs/2106.11188)

Another article dealing with statistical modeling, this time under the OLS framework. Precisely, this article is about a new R package, maars, which offer resampling-based variance estimators and pretty printing of summary results, along tidy syntax. I’ve never been a great fan of the tidy family of foo’s, but your mileage may vary. The visual diagnostic addons (“focal slope” and “nonlinearity detection) are probably the best aspects of this package.

Choosing the Estimand When Matching or Weighting in Observational Studies (https://arxiv.org/abs/2106.10577)

With a background in experimental design and behavioral measurement, it took me a long time to get familiar with econometric and epidemiological vocabulary when I was working as a biostatistician in the past 15 years. That being said, this article offers an overview of the different types of estimands available, and how to decide which one to use depending on the question at hand. For instance, “if the question concerns a policy of withholding treatment from those who would currently receive it, this suggests the ATT is of interest. If the question concerns a policy of expanding treatment to those who would not currently receive it, the ATU may be of interest. If the question concerns a policy that would require all patients to be treated or to be untreated, the ATE may be of interest. If the question concerns a policy of prescribing treatment for patients under uncertainty, the ATO may be of interest.”

A Bibliography of Combinators (https://arxiv.org/abs/2106.11729)

Stephen Wolfram published lengthy posts on his “personal blog”, and here is a bibliography memento on combinators, lambda calculus & Co. available on arXiv. See also Combinators: A Centennial View.

Robust Regression Revisited: Acceleration and Improved Estimation Rates (https://arxiv.org/abs/2106.11938)

This article presents the theoretical implications of contaminated data points in computing maximum-likelihood estimates for parameters of a regression model. So-called highly robust regression is then used as it proved to be a “flexible model of corruption and can be used to study both truly adversarial data poisoning attacks (where e.g. part of the dataset is sourced from malicious respondents), as well as model misspecification, where the generative D_{X_y} does not exactly satisfy our distributional assumptions, but is close in total variation to a distribution that does.” The article is rather technical, but I learned, for instance, that there exists a nearly-linear time algorithm for robust linear regression with residual variance bounded by \sigma^2.

Inference in High-dimensional Linear Regression (https://arxiv.org/abs/2106.12001)

In the n\ll p settings, parameter estimation in a regression model becomes complicated. There are tons of approaches to perform variable selection or to reduce the dimensionality of the dataset, and in this case the authors discuss a marginalization approach to this problem: each regression parameter is estimated while treating every other parameters as nuisance factors, much like when residualizing predictors of poor interest (e.g., a set of common covariates). The interesting part of this approach is that the regression coefficients remain on their original scale, unlike regression coefficients estimated using Lasso or Ridge penalties.

ArXiving on August 2021

Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.

Distributions.jl: Definition and Modeling of Probability Distributions in the JuliaStats Ecosystem (https://arxiv.org/abs/1907.08611)

I reinstalled Julia (v1.6.1) a few months ago. Yet I haven’t had enough time to learn the language again. I saw there’s a native plotting backend, which is really great, and that there’s some sort of JIT and on first load compiling capabilities. It is supposed to improve later execution of scripts or programs, but really on first launch it is sometimes a pain. However, the language looks much better now, although I keep thinking of the old Julia 0.4 that I used to use several years ago now. This article describes the Distributions.jl package, which is part of the JuliaStats ecosystem. Note that this package allow to define and not only to manipulate pre-defined statistical distributions. Specifically, “a Distribution is an abstract type that takes two parameters: a VariateForm type which describes the dimensionality, and ValueSupport type which describes the discreteness or continuity of the support.” I really need to give Julia another try, especially now that I can pretend to be a stat dilettante. And I would like to compare how Julia’s Distribution package compared to Boost’s own routines.

Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis (https://arxiv.org/abs/2107.01246)

Yet another attempt at modeling uncertainty using a latent variable approach. This is a dense article about phylogenetic analysis of high-dimensional phenotypes using BEAST. Usually, some form of data reduction is used to decrease the space and time complexity of model fitting on such dataset. As in other statistical domains, the two main approaches consist in using either PCA-based or FA-based techniques. In the later case, we talk about phylogenetic factor analysis (PFA), and the authors discuss two alternative approaches to the existing methods which scale quadratically with the number of taxa and remain intractable for large trees. As I said, it is dense, and it’s still being read on my side, especially since I’m looking into BEAST documentation at the same time. I only used it quickly two years ago before settling on the ETE toolkit.

Preserving Diversity when Partitioning: A Geometric Approach (https://arxiv.org/abs/2107.04674)

This article deals with the idea of partitioning a sample of individuals having different characteristics (which define their diversity as a whole) into subgroup exhibiting a similar diversity index. The diversity index considered here was introduced by Simpson, and it corresponds to the inverse probability that two individuals are from the same type when taken uniformly at random, with replacement, from the community of interest. The paper is rather challenging, since it exposes the mathematical aspects of optimal partitioning and algorithm-related approaches. The authors’ conclusions are that a perfect partition exists only when the number of parts k divides the gcd of b. We also design a polynomial time algorithm for the case of r = 2 types.

Item Response Thresholds Models (https://arxiv.org/abs/2106.12784)

I’m no longer into psychometrics – last workshop I organized was in 2016, last blog post around 2013, but this article resonated to me since I’ve been working on IRT models for polytomous item for quite a long time. Problem such as item with threshold reversals, response saturation, and the like were common pitfall at that time. Go take a look if you are versed into modern psychometrics.

Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data (https://arxiv.org/abs/2107.06867)

I haven’t heard about multi-block methods for a long time, and this paper discusses the result of extensive simulation using incremental subsampling to assess sensitivity and reliability of those two techniques. From the abstract, the take-away message reads as follows: First, for data reasonable within and between block structure, CCA and PLS give comparable results, with equivalent sensitivity and false positive rate. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. However, this assumes that the PCA structure is stable for a given sample. Third, statistical significance by null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that researchers should routinely assess both statistical significance and reproducibility for their data. I’m not surprised by these conclusions, since the within- and cross-block correlation structures plays an important role in both models, and often help in deciding which method to chose. Further, my own past investigation were quite in agreement with the idea of varying stability of the results depending of these correlation structures and the sample size itself.

Calibrating the scan statistic with size-dependent critical values: heuristics, methodology and computation (https://arxiv.org/abs/2107.08296)

JAGS, NIMBLE, Stan: a detailed comparison among Bayesian MCMC software Authors (https://arxiv.org/abs/2107.09357)

It’s been a long time since I last used JAGS, and I didn’t install it on my last Macbook. I never heard of NIMBLE before reading this review. Apparently it supports Bayesian models written in BUGS (or it can compile models written using R syntax to C++). The Stan documentation is among the best free textbook available nowadays. Apparently, the compile time for Stan and NIMBLE is of the same magnitude (2-3 minutes), but NIMBLE appears faster at the runtime level compared to the two other candidates. Finally, NIMBLE seems to perform quite good when it comes to estimating parameters for mixture models.

On matching-adjusted indirect comparison and calibration estimation (https://arxiv.org/abs/2107.11687)

The idea is that matching-adjusted indirect comparison, which is commonly used in RCTs to compare non-directly comparable pool of patients, share some resemblance with calibration estimation as used in survey sampling or epidemiological studies. This technique allows to use weights with minimal variance to balance covariates between the sample and target populations. In the case of RCTs, if a common treatment is available between the different samples, then it can be used as an anchor to analyze within-study differences, although this is not the main approach discussed in this article. The author provides extensive simulation results (R code available), and concludes that among the numerous approaches that can be used to construct appropriate weights, empirical likelihood weights allow to derive likelihood-based confidence intervals for treatment effect that are often more reliable than asymptotic ones.

ArXiving on October 2021

Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.

Accuracy, precision, and agreement statistical tests for Bland-Altman method (https://arxiv.org/abs/2108.12937)

I’ve used Bland-Altman plots a lot when I was working on diagnostic tests and psychometric measurement. Interestingly, few medics really understood how to interpret those kind of graphical displays, and how they are used in method comparison studies. I suspect they also dislike this approach since it does not provide formal tests of hypothesis. Here we go, with three tests for accuracy, precision and agreement.

Generalized nearly isotonic regression (https://arxiv.org/abs/2108.13010)

Lagged couplings diagnose Markov chain Monte Carlo phylogenetic inference (https://arxiv.org/abs/2108.13328)

A Q-Q plot aids interpretation of the False Discovery Rate (https://arxiv.org/abs/2109.02118)

A simple yet efficient graphical method to assess significant p-values computed using Benjamini-Hochberg approach.

Dynamic Network regression (https://arxiv.org/abs/2109.02981)

This article deals with Fréchet regression on network data, using the corresponding space of graph Laplacians (with no restriction on their rank). It has two applications (NY taxi trips after COVID-19 and dynamic networks in aging brains) which I found quite interesting. Such an approach might prove superior to classical data mining on spatio-temporal data structures since we can capture both the structure and weight information of a networks given relevant covariates, and assess the quality of adjustement for projected trends.

High Performance Implementation of the Hierarchical Likelihood for Generalized Linear Mixed Models. An Application to estimate the potassium reference range in massive Electronic Health Records datasets (https://arxiv.org/abs/1910.08179)

Not much of an interest for me except that it deals with Potassium reference level which is a subject I have been familiar with (for personal reason, unfortunately) for several years now. This remains, however, an interesting simulation study for mixed-effects models and big data aficionados.

Distcomp: Comparing distributions (https://arxiv.org/abs/2110.02327)

This paper introduces a Stata command that allows to compare empirical CDFs. Contrary to the Kolmogorov-Smirnov approach, distcomp allows to test the equality of the distribution functions point by point., using appropriate FWER control, and displays ranges of values in which the distributions’ difference is statistically significant.

ArXiving on January 2022

Here are a few papers that I read over the past two months, in the CS and Stat category as usual.

At the time of this writing you can replace the x in arxiv URL to get a pretty HTML rendering of the article. See below for the second article discussed above.

Type Stability in Julia: Avoiding Performance Pathologies in JIT Compilation (https://arxiv.org/abs/2109.01950)

I still haven’t really gotten into the Julia language. I follow the news here and there and I’m glad to know that there is now a native graphical backend, not just web output. The last time I tested Julia (v1.6), I found the runtime of a first script to be quite substantial, although I understand why that is the case. The language server is slow too, because of the initial indexing of the symbols contained in all the installed packages. Nevertheless, I think Julia has a future and I hope there will soon be a consolidated set of tools for statistical modeling, much like Frank Harrell’s rms package. As I said in a previous post, for the moment I want to investigate Racket and Lisp-like languages, or maybe Haskell, to see if they could provide a good fit for my own workflow (even if I rarely perform statistical analyses nowadays). Julia remains, however, an interesting approach to statistical computing nowadays.

An Introduction to Quantum Computing for Statisticians (https://arxiv.org/abs/2112.06587)

Just out of curiosity, and because I was reading Introduction to Classical and Quantum Computing (PDF) by Thomas G. Wong, I fetched this paper on quantum computing for statisticians. Well, I haven’t read it actually, but this look interesting as far as I can tell.

Aitchison’s Compositional Data Analysis 40 Years On: A Reappraisal (https://arxiv.org/abs/2201.05197)

It’s been a long time since I last read a paper written by Michael Greenacre. Last time was probably when I was working on multi-block methods back in 2016. This paper discusses the original approach to compositional data analysis, which is usually associated with lot of zeros in the dataset, and suggests that, contrary to the classical use of isometric logratios, soft constraints may yield correct and almost identical interpretation of the data in most cases. I have only skimmed the article but I will probably come back to it if I need to go deeper into the subject and the relationship between this type of approach and the techniques based on correspondence analysis.

How ISO C became unusable for operating systems development (https://arxiv.org/abs/2201.07845)

The take-away message is that the primary reason why ISO C is poorly designed for OS development relates to a “design approach in the ISO standard that has given priority to certain kinds of optimization over both correctness and the ‘high-level assembler’ intentions of C, even while the latter remain enshrined in the rationale.” Among other things, pointers (casting from and aliasing to) remain problematic in several edge cases.

I still use C because this is one of the first languages I learned when I was in grad school (after Turbo Pascal, you named it). I like it because of its syntax and its simplicity, but I usually only write toy programs so I don’t have to bother with all its “pitfalls” regarding low-level and hard-core programming. I don’t use C++, but I’m rather interested in learning Rust because I like the tooling as well as rust-analyzer. Yet, I know it’s a hard path.

ArXiving on July 2022

Here are a few papers that I read over the past months, in the CS and Stat category as usual.

Marginal Effects for Non-Linear Prediction Functions (https://arxiv.org/abs/2201.08837)

For everything related to marginal effects, I usually trust Stata margins command. By the way, there is a dedicated book for this very useful command written by Michael N. Mitchell. The present article deals with estimating marginal effects in the case of non-linear models, which usually involves derivatives as discussed in an old post. Here the authors argue in favor of forward differences, hence the approach coined “forward marginal effects.”

Better to be in agreement than in bad company: a critical analysis of many kappa-like tests assessing one-million 2x2 contingency tables (https://arxiv.org/abs/2203.09628)

I once gave a brief overview of recommend techniques for the cross tabulation of two categorical variables. This article goes a lot deeper by discussing agreement measures in 2x2 tables, inlcuding the cases of extreme tables where there’s a high level of agreement or a large imbalance between cells. The authors suggest that Holley and Guilford’s G and Gwet’s AC1 are the best candidates, among many other agreement measures (Cohen’s kappa, Pearson’s r, Yule’s Q and Y, Scott’s \pi, Shankar and Bangdiwala’s B, Dice’s F1 and McNemar’s \chi^2).

Selective inference for k-means clustering (https://arxiv.org/abs/2203.15267)

Unsupervised learning does not lead naturally to statistical inference, except maybe in the case of assessing clustering stability or choosing the optimal number of clusters. I discussed resampling-based appraoches in another post. However, with added constraints like in Mclust, we can work out the likelihood and optimize the number of clusters, for instance. Another inferential issue is that of testing whether cluster members differ on average on some attributes. Usually, we don’t test attributes that were used to build the partitioning: it is often meaningless since we are usually trying to maximize the separation between clusters based on those very specific attributes, and in any way this would lead to inflated Type I error rate. Most of the times, we establish and eventually test clustering profiles based on held out variables. In this paper, the authors propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.

A tutorial for using propensity score weighting for moderation analysis: an application to smoking disparities among LGB adults (https://arxiv.org/abs/2204.03345)

This aricle is in fact mostly a tutorial on moderation analysis using PS weighting, and illustrations in Stata (inline code) and R (external package) are provided.

Latent Trait Item Response Models for Continuous Responses (https://arxiv.org/abs/2204.03841)

Usually, latent trait models apply to categorical response variable (dichotomous or polytomous items, or rating scales). The objective is to build a continuous score reflecting the position of an individual on the ltent trait being measured. For a more detailed overview, see my answer on Cross Validated. This article deals with restricted continuous responses (e.g. positive only or interval-based responses), and extends response time models.

Uniformly Valid Inference Based on the Lasso in Linear Mixed Models (https://arxiv.org/abs/2204.03887)

The authors use restricted maximum likelihood (REML) estimators to separate the estimation of the regression coefficients and covariance parameters in order to estimate fixed-effects with L1 penalization in Gaussian linear mixed models.

The replication of non-inferiority and equivalence studies (https://arxiv.org/abs/2204.06960)

The authors use the TOST technique to analyze replication equivalence studies. A “success interval” for the relative effect size is used to compare the replicate to the original study effect.

Flexible Marginal Models for Dependent Data (https://arxiv.org/abs/2204.07188)

An interesting paper to complement one companion textbook I worked on long time ago.

ArXiving on August 2022

Here are a few papers that I read over the past weeks, in the CS and Stat category as usual.

Inverse Probability Weighting: the Missing Link between Survey Sampling and Evidence Estimation (https://arxiv.org/abs/2204.14121)

Inverse probability weighting is used in survey and epidemiological analysis. I only used this technique once, and it was in quick comparison with propensity score analysis, where the key idea was in both case to achieve balance between treated and control subjects in order to reduce confounding. See Chan, K. C. G., et al. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting (Journal of the Royal Statistical Society: Series B 78(3), 2016) for recent work. In this article, the authors discuss the use of binning and smoothing, or, as an alternative, hierarchical modeling. Their conclusions are that those Bayesian estimators perform better (i.e., they have lower mean squared errors) compared to simple IPW estimators.

Redefining Populations of Inference for Generalizations from Small Studies (https://arxiv.org/abs/2204.14156)

This article deals with the inference population that small sample studies usually focus on, and how it can be improved to foster valid generalizations (of the population average treatment effect, or PATE). This is based on the results from a cluster randomized trials in educational assessment. The two approaches considered here are coined “quantitatively optimal” and “policy-relevant” approaches. In the former case, the analysis relies on propensity scoring with varying cutoffs or the distributions of covariates. In the latter case, the idea is to identify subgroups that may benefit from the intervention. To improve imprecision, bot approaches can be undertaken, while for bias reduction only the quantitatively optimal approaches seem relevant.

Practical considerations for specifying a super learner (https://arxiv.org/abs/2204.06139)

Along with classical parametric regression models or ensemble methods in machine learning, the super learner algorithm allows to learn a prediction function without needing to set on a fixed modeling approach: as an entire pre-specified and adaptive strategy, it allow the modeler to consider several predictive models at once. Here’s the big picture:

For the practictioner, the core decisions are to (1) specify the performance metric for the discrete super learner, (2) derive the effective sample size, (3) define the V-fold cross-validation scheme, (4) form library of candidate learners, and (5) to use discrete super learner to select the candidate with the best cross-validated predictive performance.

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections (https://arxiv.org/abs/2204.06077)

Correction of overfitting bias in regression models (https://arxiv.org/abs/2204.05827)

This article describes solutions to correct for the biais arising from overfitting datastes where p/n < 1. Besides regularization techniques, the asymptotic characteristic function of the ML estimators can be used to estimate the overfitting bias using only a small set of non-linear equations. The maths are dense, and I will probably need to reread the hard part.

ArXiving on October 2022

Approximating Persistent Homology for Large Datasets (https://arxiv.org/abs/2204.09155)

At first sight, it sounded really interesting, especially since the authors used bootstrap, and perhaps with bioinformatics applications in mind:

Then, I realized that it’s a hard paper (for me at least). I’ll keep it under my hat for later, when I’ll have clearer ideas.

A new preferential model with homophily for recommender systems (https://arxiv.org/abs/2204.11819)

I’ve been teaching the basis of recommender systems in an engineering school for two years. I liked it, especially when discussing matrix factorization algorithms and Co. Applications speak for themselves (everybody knows Amazon or Spotify, right?), and I missed Seth Brown’s beer recommendation engine. This paper is more about social networks and preferential attachment (rich-get-richer and homophily). The authors show that the number of vertices in their K-groups preferential attachment model follows a power law, and they dscuss two applications on real web data.

Locality Sensitive Hashing for Structured Data: A Survey (https://arxiv.org/abs/2204.11209)

Locality sensitive hashing (LSH) is used to hash alike items into the same buckets with high probability, which means it can be used as a data reduction or clustering method, bypassing any supervised learning process. However, this usually does not apply to ordered items, like trees, graphs or sequences. The authors survey relevant techniques for such hierarchical LSH algorithms.

Marginal log-linear models and mediation analysis (https://arxiv.org/abs/2204.08538)

Yet another paper on log-linear analysis, this time in the context of mediation analysis (a hot topic in psychiatric and psychological studies in the last ten years when people realized that causal analysis may or should not be that easy).

Simultaneous confidence intervals for the interpretation of primary and secondary effects in factorial designs without a pre-test on interaction (https://arxiv.org/abs/2204.08336)

Here, the author discusses the use of parametric or non-parametric simultaneous confidence intervals between the levels of a primary factor for both separate and joint secondary factor levels in an unbalanced factorial design with 2 or 3 factors (where one of the factor has more than 2 levels).

A Note on Simulation-Based Inference by Matching Random Features (https://arxiv.org/abs/2111.09220)

ArXiving on January 2023

Polynomial spline regression: Theory and Application (https://arxiv.org/abs/2212.14777)

The Box-Cox transformation is one of the many tools one can use to ease traditional linear modeling of data.Semi-parametric regression, whihc I never really used for real work is also an option. Flexible regression methods, as emphasized in Harrell’s RMS textbook¹ or the use of generalized additive models² are another. I’ve been doing some research on orthogonal polynomial, restricted cubic splines, and other spline fitting procedures, and then I found this paper. It introduces different frameworks (Truncated Power, B-spline, and P-Spline), and a quick application on a time-series dataset.

Parametrization Cookbook: A set of Bijective Parametrizations for using Machine Learning methods in Statistical Inference (https://arxiv.org/abs/2301.08297)

Looks interesting at first sight, but I haven’t had time to read it. Maybe you will be interested, though.

An Introduction to Probabilistic Programming (https://arxiv.org/abs/1809.10756)

This is a 300-page long tutorial on probabilistic programming written by one of the authors of Anglican, which seems like a staled project at this time. Still, it is a very valuable account of probabilistic programming at large and a peculiar implementation targeting the JVM. See also these slides.

A Practical Introduction to Regression Discontinuity Designs: Extensions (https://arxiv.org/abs/2301.08958)

This is a rather long monograph on RDDs. It covers local randomization (which assumes that the assignment of units above or below the cutoff was as if random inside a given window and that in this window the potential outcomes are unrelated to the score), fuzzy RDD (which accounts for non-compliance), and discrete and multidimensional designs. There’s a preliminary monograph from the same authors that only deals with the classical RDD, as well as a practical for medical applications.

Improving Software Engineering in Biostatistics: Challenges and Opportunities (https://arxiv.org/abs/2301.11791)

A few months ago, I read on a blog that statisticians were very bad at programming because they lack skills in software engineering. Let me suggest two additional remarks: applied statisticians must provide results, quickly whenever possible, and when they can use agile programming they do, most of the time. This does not mean they will deliver a product of industrial grade, but if they can reproduce their own results six months later, this already is good news. I guess many software engineers never looked at statlib or established algorithms; you can use TDD, CI, regression tests and the like; if your algorithm is far from the standards, it won’t do any good.

A statistical framework for planning and analysing test-retest studies for repeatability of quantitative biomarker measurements (https://arxiv.org/abs/2301.11690)

Sometimes, I feel like I’m reading the same story every year: “biomarkers”, measurement error, practical significant difference, etc. Ioannidis warned us a long time ago.

From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions (https://arxiv.org/abs/2301.11797)

The authors hereby suggests “probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions.” His conclusion is that the Brier score is a good candidate. The problem with discrimination indices is that they perform poorly when the observed prevalence of the outcome is very low (which happens pretty often in medical applications) despite highly significant variables.

Getting to the Point. Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming (https://arxiv.org/abs/2104.05372)

I learned about Dex on Darren Wilkinson’s blog. Later on, I learned about Futhark, but see A comparison of Futhark and Dex. I never used any one of these DSLs, but I warmly recommend reading what Darren Wilkinson has to say about those frameworks.

ArXiving on September 2023

Unidimensionality in Rasch Models: Efficient Item Selection and Hierarchical Clustering Methods Based on Marginal Estimates (https://arxiv.org/abs/2309.00553)

Since I am no longer involved in psychometrics or this kind of stuff (even on Cross Validated), I rarely read this kind of papers nowadays. Selecting misfitting items on a scale has a long history, especially for Rasch modellers. This paper exposes a sort of clustering method that allows to flag item, based on the variance of a mixing distribution, that do not belong with a set of items sharing a commin trait.

Optimal Scaling transformations to model non-linear relations in GLMs with ordered and unordered predictors (https://arxiv.org/abs/2309.00419)

Still a memory of my previous job where I used to use optimal scaling a lot, but in the context of projection methods in multivariate exploratory data analysis. Here, Meulman and collaborators consider optimal scaling as an alterantive framework to GAM to allow non linearity in discrete predictors. In particular, this alleviares the need for dummy coded variable levels, via quantization with monotonicity constraints, which facilitates the interpretation of the resulting output. There will probably an R package available at some point. See also ROS Regression: Integrating Regularization with Optimal Scaling Regression.

An introduction to graph theory (https://arxiv.org/abs/2308.04512)

A complete course on graph theory, including network flows, for graduate students. It is rather extensive (422 pp.) and there are a lot of illustrations.

Permutation testing in high-dimensional linear models: an empirical investigation (https://arxiv.org/abs/2001.01466)

This is an extension to the permutation testing framework to the case where the number of predictors exceeds the sample size. In classical settings (p < n), the startegy amounts to compute the test statistics under random permutation of the residuals.³ When n \ll p, however, regularization methos like the elastic net appoach msut be used. The authors recall that for minimizing prediction error, ridge regression is often preferrable to Lasso, principal components regression, variable subset selection and partial least squares. Moreover, ridge regression is close to the Freedman-Lane approach, which is based on semi-partial correlations. The authors finally suggest to use double residualization, which is inspired by the Kennedy method, which residualizes both Y and X and proceeds to permute the Y-residuals, ⁴ but in this paper the authors replace the least squares regression by ridge regression.

Tutorial: a priori estimation of sample size, effect size, and statistical power for cluster analysis, latent class analysis, and multivariate mixture models (https://arxiv.org/abs/2309.00866)

This is a short review of power analysis for k-means, Ward agglomerative hierarchical clustering, c-means fuzzy clustering, latent class analysis, latent profile analysis, and Gaussian mixture modelling. Results based on simulated datasets are summarized in Table 1 of the paper, reproduced below.

Supervised dimensionality reduction for multiple imputation by chained equations (https://arxiv.org/abs/2309.01608)

In this paper, the authors discuss the use of principal component, principal covariates, and partial least squares regression, instead of unsupervised PCA, with MICE. Results show that supervised appraoches perform better, and that supervised principal component regression has smaller bias and better confidence interval coverage for a wider range of retained components, independent of the number of latent variables.

ArXiving on October 2023

Fisher’s Pioneering work on Discriminant Analysis and its Impact on AI (<https://arxiv.org/abs/2309.04774v1)

It’s a little bit funny to see the name of Ronald Fisher associated to AI in the title of scientific paper, especially since AI now means LLMs and other brilliant foundations for the future millennium. Anyway, the iris dataset is used to illustrate the use of a discriminant function (as the ratio of squared mean and variance) to demonstrate whether “when we use the linear compound of the four measurements most appropriate for discriminating three such species, the mean value for I. versicolor takes an intermediate value and, if so, whether it differs twice as much from I. setosa as from I. virginica, as might be expected, if the effect of genes are simply additive, in a hybrid between a diploid and a tetraploid species.” ⁵ Hence, a genetic hypothesis, which can be exploited to find the optimal linear combination of population means, \mu_1 -3\mu2 + 2\mu_3 = 0 (respectively, setosa, versicola and virginica), was at the heart of Fisher’s seminal paper, like many of his other contributions. Sadly, the iris dataset is mostly used in the context of unsupervised classification (k-means, for instance), as if people forgot that it relies at its heart on a well-defined hypothesis (much like people forgot or still ignore that Student (Gosset’s nom de plume) developed his test while working in a brewery). It seems the point of the authors is the notion of a discriminating rule, which, when applied to AI problems, amounts to a supervised approach that aims to find a “rule to separate distinct populations, and then to use that rule (or a different classification scheme) to allocate a new observation to the given populations.”

A compendium of data sources for data science, machine learning, and artificial intelligence (https://arxiv.org/abs/2309.05682)

This article summarizes a long list of data repositories spanning a dozen or so domains (including satellite data or sports, for instance). I will keep it on hand just in case I need some dataset one day.

Demystifying Statistical Matching Algorithms for Big Data (https://arxiv.org/abs/2309.05859)

Propensity scoring is widely used in the absence of a controlled experiment, e.g. in observational studies, where as the authors say matching is used to pair treated and control units with similar value of potential confounders (aka, the propensity score). Usually, matching is done with replacement, that is a control is paired to several treated units (1-to-n matching, with replacement). There are other estimators (e.g., inverse probability weighting or doubly robust methods), though. Oversampling and replacement strategies may alter the balance between bias and variance. In the case of matching without replacement, greedy approaches are used; in Stata, there’s a command calipmatch which does the job for you, while in R the /de facto/ packages are probably Matching and MatchIt. In this paper, the authors review alternative approaches, namely optimal statistical matching (linear balanced and unbalanced assignment problem, maximum cardinality matching, and minimal cost matching).

Subgroup detection in linear growth curve models with generalized linear mixed model (GLMM) trees (https://arxiv.org/abs/2309.05862)

Here is another paper by Achim Zeilis on regression trees. This time, it deals with “extended” GLMMs for identifying subgroups in growth curve modeling. It’s been a long time since I first used partykit for tree-structured regression and classification models, and this was in a benchmark with classical CART, logic regression, and ensemble methods (random forests and boosting); this didn’t end up in a paper, but it was a fun exercise. Since then, a lot of work has been put forward into this framework and I’m not really up to date. Unlike other methods for partitioning latent growth curves, this modeling framework set up a mixed approach: “GLMM trees do not fit a full parametric model in each of the subgroups defined by the terminal nodes of the tree. Instead, fixed-effects parameters are estimated locally, using the observations within a terminal node, and the random-effects parameters are estimated globally, using all observations.”

A Change-Point Approach to Estimating the Proportion of False Null Hypotheses in Multiple Testing (https://arxiv.org/abs/2309.10017)

I have used FDR correction for multiple comparisons here and there, as an alternative to sequential methods like Holm’s approach (which is the default in R’s p.adjust, btw). The authors offer an alternative data-driven method to the classical Storey’s method (i.e., a family of true null proportion of false null hypotheses estimators) for estimating the FDR. In this case, the sorted p-values are approximated using a piecewise linear function with one change point (inversion of slope), as illustrated below. This is somewhat akin to the knee/elbow method or scree plots for determining the number of principal components or factors to retain in PCA or FA, which is discussed in section C.1.

A New Bootstrap Goodness-of-Fit Test for Normal Linear Regression Models (https://arxiv.org/abs/2309.10614)

Yet another article discussing the benefit of using bootstrap to assess model goodness-of-fit. Rather than visual inspection of so-called residual plots (residuals vs. fitted values), especially in multivariate regression, or relying on either the White or Breusch-Pagan test. The test presented here allows to test all assumptions of normal linear regression:

For test level \alpha, candidate normal linear model \mathcal{M}, bootstrap iterations B, sample size n, and a null hypothesis that \mathcal{M} is adequately specified:

Unveiling Challenges in Mendelian Randomization for Gene-Environment Interaction (https://arxiv.org/abs/2309.12152)

As a sequel to this 10-years old post, here is a paper that uses genetic variants as instrumental variables (2-stage predictor substitution and the 2-stage residual inclusion) to estimate causal effects in GWAS on lifestyle and environmental factors.

ArXiving on November 2023

The 4+1 Model of Data Science (https://arxiv.org/abs/2311.07631)

I’m still not a big fan of the term data science (as a discipline), and I much prefer statistician or data engineer, but let’s forget about that for a moment. In this article, Rafael Alvarado offers to extend the consensual view according to which data science lies at the intersection between three main domains of expertise: computer science and technology, mathematics and statistics, and domain knowledge, which I found even more confusing. Specifically, he considers the following four main domains: value, design, systems, and analytics, to which a fifth domain may be added, namely practice. In a related article, his review led him to conclude that the three driving factors of all the buzz around the sexiest job the 21st century are “(1) the long-standing opposition between data analysts and data miners that continues to animate the field, (2) an established definition of the term as the practice of managing and processing scientific data that has been occluded by recent usage, and (3) the phenomenon of data impedance.” Much ado about nothing in my opinion, unless you consider that the volume and nature of data being processed these days implies a major change in terminology. But is it really that important if it’s just a question of terminology? Because the job remains fundamentally the same, after all.

Resampling-free bootstrap inference for quantiles (https://arxiv.org/abs/2202.10992)

A paper I read last year, thanks to Evan Miller, but go read his own blog post which covers pretty all the math. While the authors originally proposed a Poisson model for bootstrapping differences in quantiles, whereby each observation in the original sample is included in each bootstrap sample according to a Poisson distribution with parameter 1, Evan Miller suggested an approximation which involves a Bessel function.

Frame to frame interpolation for high-dimensional data visualisation using the woylier package (https://arxiv.org/abs/2311.08181)

A new paper by Dianne Cook, whose contributions I have always greatly appreciated since I discovered ggobi (the website is still alive apparently). The “grand tour” (and other derivatives) is a method that allows to view data in more than three dimensions using linear projection, essentially by using rotations of a lower-dimensional projection in high-dimensional space. I’ve only seen this implemented efficiently in ggobi, but there may have been other attempts. Frames are stacked together with small interpolation to create a smooth motion. However, projection are usually rotationally invariant (when using geodesic interpolation algorithm) which may not be a desirable property when, e.g., we are looking for non-linear association in a high-dimensional dataset. The present package works along the tourr package.

Visualizing adverse events in clinical trials using correspondence analysis with R-package visae (https://arxiv.org/abs/2101.03454)

Although I’m no longer involved in medical statistics, I keep an eye open on data visualization and applied methodology in the field. This article drew my attention a few weeks ago because it relies on multivariate projection methods, in this case correspondance analysis and biplots. There’s an associated GitHub repository where you will find code (shiny and ggplot) and illustrations. The advantage of biplots as summary displays is that they help highlighting complex relationship (or, in some cases, interactions). For instance, in the associated BMC paper the authors published alongside this preprint, they used biplots to highlight discrepancies between single agent treatments and their combinations with oxaliplatin at all levels of AE classes, which explain most of the variability among treatments.

Identifiable and interpretable nonparametric factor analysis (https://arxiv.org/abs/2311.08254)

Contrary to Gaussian models or deep neural networks, which show poor performance in terms of identifiability, interpretability, and reproducibility, the authors propose to use the “NIFTY framework, which parsimoniously transforms uniform latent variables using one-dimensional nonlinear mappings and then applies a linear generative model.” Briefly stated, such an approach allows to work with a standard linear latent factor model, which is easier to interpret compared to the aforementioned models, which amounts to a Bayesian independent component analysis under certain conditions.

Generating Signed Permutations by Twisting Two-Sided Ribbons (https://arxiv.org/abs/2311.06974)

An interesting review of algorithms available to generate permutations, and specifically signed permutations which are permutations of [n] objects with a \pm sign. In the case of permutations, there are n! solutions, while for signed permutations there are 2^n\cdot n! solutions. Not something I am really interested in at the moment, but I’ll keep it in mynd in case I need it. There are a lot of nice illustrations, BTW.

ArXiving on February 2024

Looking Back at Postgres (https://arxiv.org/abs/1901.01973)

I rarely use Postgres for my personal hobbies (I usually find that SQLite provides me with convenient cost-effectiveness tradeoff for pet projects), but I always appreciated it for long term projects as well as its extensions. This paper is a recollection of the early days of Postgres in the 80s.

Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables (https://arxiv.org/abs/2401.06261)

I wrote about Mendelian randomization a while ago. This paper is about its multivariate extension in the context of causal inference using instrumental variables in a study of genome-wide significant GWAS loci. When the instrumental set condition is satisfied, direct causal effects of all exposures in the model on the outcome variable of interest can be identified.

Asymptotic Online FWER Control for Dependent Test Statistics (https://arxiv.org/abs/2401.09559)

If you ever applied Bonferroni correction (or any stepwise procedure) for multiple comparison, you know that we assume that tests are independent, and in the case of Bonferonni it may be way too much conservative. This is even more complicated in the case of online testing. An alternative approach is proposed in this article, namely an adaptive version of the alpha-spending algorithm used, e.g., in interim analyses, ⁶ where we update the p-values depending on whether they are believed to have arisen under the null hypothesis or the alternative.

A Computationally Efficient Approach to False Discovery Rate Control and Power Maximisation via Randomisation and Mirror Statistic (<https://arxiv.org/abs/2401.12697) is yet another paper on FDR rather than FWER control. See also Gridsemble: Selective Ensembling for False Discovery Rates for related to estimating local false discovery rates in large-scale multiple hypothesis testing.

Likelihood-ratio inference on differences in quantiles (https://arxiv.org/pdf/2401.10233)

I barely used quantile regression in my previous life (aka when I was working as a full-time statistician), except in a few cases when dealing with growth curves and newborn data. I came across this paper because I’ve been a long time follower of Evan Miller’s posts. He’s actually discussing how to implement a two-sample difference-in-quantile hypothesis test and how to construct confidence intervals based on a likelihood ratio test.

On the visualisation of the correlation matrix (https://arxiv.org/abs/2401.12730)

I like biplot graphical displays, despite their subtleties regarding how rows and/or columns of a correlation or distance matrix are adjusted. In this paper, the author suugests “a weighted alternating least squares algorithm is used, with either a single scalar ad- justment, or a column-only adjustment with symmetric factorization; these choices form a compromise between the numerical accuracy of the approximation and the comprehensibility of the obtained correlation biplots.”

Pretraining and the Lasso (https://arxiv.org/abs/2401.12911)

ArXiving on July 2025

Multiple Hypothesis Testing in Genomics (https://arxiv.org/abs/2506.23428)

An overview of different approach to control the false discovery rate in multiple testing: Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), and Storey’s method. The BH method is already available in R’s p.adjust(). The BY method is just an extension of the BH method that accounts for positive correlations among test statistics, while Storey’s technique estimates the proportion of null hypotheses \pi_0 among the tested CDS. Whether you rely on a nominal \alpha level or q-value cutoffs, as in proteomics analyses, all those methods work equally well. The author’s conclusion is as follows:

Note that although this targets RNA-Seq analysis, where Multiple Hypothesis Testing is one of the many challenges raised by the big data omics era, this should apply to other contexts as well. It is worth noting that the authors included a simulation model for count data, using a Negative Binomial Model, which is what is usually found in pipeline like DESeq2.

I appreciated that the author provided a sidenote on statistical significance and biological relevance: “It is also crucial to recognize that statistical significance does not always imply biological importance, nor does a lack of significance disprove a biological hypothesis. In highly variable, small-sample contexts, meaningful changes may evade detection, reinforcing the need for well-powered experiments. Similarly, even a modest difference can be flagged as significant in extremely large datasets. These nuances highlight that DE detection should be interpreted within the broader scope of effect sizes, reproducibility, and biological validation.”

Controlling the false discovery rate under a non-parametric graphical (https://arxiv.org/abs/2506.24126)

Yet another paper on controlling the false discovery rate. As mentioned above, when there’s some correlation between tested markers, modification of the default FDR correction (Benjamini–Hochberg) is in order. The trick the authors used in this work is to take into account a dependecy graph for p-values, allowing to assume independence of p-values that are not within each other’s neighborhoods, and otherwise leave the dependence unspecified.

clustra: A multi-platform k-means clustering algorithm for analysis of longitudinal trajectories in large electronic health records data (https://arxiv.org/abs/2507.00962)

It’s been years I’m no longer involved in healths analytics and patient-reported outcomes, but this one reminded me of a post on Cross Validated about clustering of longitudinal data. Since I’m barely reminescent of those days, I remember one answer of mine which took up my entire morning and into which I put a lot of effort and attention.

Targeted tuning of random forests for quantile estimation and prediction intervals (https://arxiv.org/abs/2507.01430)

Again, it’s been a long time since I ever ran an RF algorithm on real data points. It looks like the RF framework has been extended far and long, and even for quantile regression. In this regard, the out-of-bag sample is used to estimate the bias of the marginal quantile coverage probability estimate.

Tensor-product interactions in Markov-switching models (https://arxiv.org/abs/2507.01555)

This is a novel approach for semiparametric inference in hidden-state models, which relies on incorporating tensor-product interactions into Markov-switching models, also known as Hidden Markov Model, a mix of two stochastic processes (one observed and the other latent). It’s dense and I need to read this when I’ll have more time. Now, let’s go back to holidays, FWIW.

ArXiving on August 2025

Statistical methods: Basic concepts, interpretations, and cautions (https://arxiv.org/abs/2508.10168)

This is a chapter from the 3rd edition of the Handbook of Epidemiology (Pigeot and Ahrens, eds.). It’s been a long time since I haven’t read papers or textbook on epidemiology, but I had this one on my bookshelf at some point in my life. This is all about the study of associations and causal relationships outside the field of randomized clinical trials, which are discussed upfront:

But the simplifying assumptions are a recurring theme in the rest of the chapter, as well as that of a fictional world in which all assumptions hold. There is a very enlightning discussion on p-values, how they are often misinterpreted, and how they should be regarded in light of observed data (see pp.19 ff.). And basically, the whole paper is about statistical significance and how not to fool onto the trap of significant or non-signifcant p-values. Of note, Bayesian statistics are briefly discussed. I must admit I never encountered a single epidemiological paper using Bayesian inference.

A Comprehensive Comparison of the Wald, Wilson, and adjusted Wilson Confidence Intervals for Proportions (https://arxiv.org/abs/2508.10223)

While two-group comparisons with a continuous outcome are generally easily understood and carried out using a t or Wilcoxon test (although, in the latter case, with bad assumptions), estimating the precision of proportion estimates always rely on Wald confidence intervals, aka the “normal approximation”. In this case, the proprtion of success is estimated as \hat p \pm z_{\alpha}\sqrt{\frac{\hat p(1 - \hat p)}{n}}, where z_{\alpha} is the 1-\alpha/2 quantile of the standard normal distribution (z=1.96 for a 95% confidence interval). An alternative option is to rely on Wilson’s CI, and the well-known Agresti-Couli interval provides a very good approximation to it. You may recall that in the case of proportion, the standard error may be computed from the estimated proportion, \hat p, which yield Wald CIs, or from the proportion under H_0, p_0, which is the score test statistic (also called z-test in most statistical packages).⁷ The Wilson CIs are more tedious to compute (see Formula 2 in the paper and this blog post). The authors describe adjusted Wilson intervals for 90, 95, and 99% confidence level. The color scheme retained for these analyses remain, however, quite surprising.

Better bootstrap t confidence intervals for the mean (https://arxiv.org/abs/2508.10083)

For small samples and/or skewed data, the Bayesian t-test is a good option (see this unfinished post of mine). BCA correction for a bootstrap t-test usually undercovers the mean, and other problems appear when we consider the studentized rather than the arithmetic mean. In this paper the authors consider various techniques to overcome limitations arising from small samples and they show that the power law bootstrap t and the beta bootstrap t (decsribed in §2.5) provide good results, the latter being second order accurate.

Dissecting Microbial Community Structure and Heterogeneity via Multivariate Covariate-Adjusted Clustering (https://arxiv.org/abs/2508.11036)

I had to help a student two years ago for a microbial analysis, and she found an arcane R package on GitHub which relied on Spearman correlation of table of counts, IIRC, and allowed to build a network of co-occurring species, but this didn’t allow to take into account co-factors of interest. Unfortunately, the experimental design was flawed, so the analysis ended up early. It is interesting to see that covariates can be taken into account in the analysis of such large dataset .

What makes a study design quasi-experimental? The case of difference-in-differences (https://arxiv.org/abs/2508.13945)

This papers provides a use case of difference-in-differences, and its relation with the parallel trends assumption, which usually holds if the exposure is randomly assigned. Random treatment allocation, however, does not mean we need to use a DID estimator, and post-treatment comparison should be preferred.

Perspective: An outlook on fluorescence tracking (https://arxiv.org/abs/2508.13668)

For someone like me who deal with data coming from cellular biology, this provides an excellent overview of fluorescence microscopy.

See also Coin tossing experiment: Score and Wald tests.↩︎
Kennedy, F. E. Randomization tests in econometrics. Journal of Business & Economic Statistics, 13 (1):85–94, 1995.↩︎
See also Coin tossing experiment: Score and Wald tests.↩︎
Kennedy, F. E. Randomization tests in econometrics. Journal of Business & Economic Statistics, 13 (1):85–94, 1995.↩︎
See also Coin tossing experiment: Score and Wald tests.↩︎
See also Coin tossing experiment: Score and Wald tests.↩︎
See also Coin tossing experiment: Score and Wald tests.↩︎

Recent readings on arXiv

arXiv:1801.00631 Deep Learning: A Critical Appraisal, Gary Marcus

arXiv:1712.09988 Orthogonal ML for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels, Victor Chernozhukov • Matt Goldman • Vira Semenova • Matt Taddy

arXiv:1703.00864 The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings, Krzysztof Choromanski • Mark Rowland • Adrian Weller

arXiv:1801.00371 Data Science vs. Statistics: Two Cultures?, Iain Carmichael • J.S. Marron

arXiv:1801.00364 Estimation and Inference of Treatment Effects with L2-Boosting in High-Dimensional Settings, Ye Luo • Martin Spindler

arXiv:1801.00753 Probabilistic supervised learning, Frithjof Gressmann • Franz J. Király • Bilal Mateen • Harald Oberhauser

arXiv:1404.4178v6 Speeding Up MCMC by Efficient Data Subsampling, Matias Quiroz • Robert Kohn • Mattias Villani • Minh-Ngoc Tran

arXiv:1608.04481 Lecture notes on randomized numerical linear algebra, Michael W. Mahoney

ArXiving on March 2018

Crisan, A., Munzner, T., & Gardy, J. L. (2018). Adjutant: an R-based tool to support topic discovery for systematic and literature review

Nuijten, M. B., Van Assen, M. A. L. M., Augusteijn, H. E. M., Crompvoets, E. A. V., & Wicherts, J. M. (2018). Effect sizes, power, and biases in intelligence research: a meta-meta-analysis

Lehman, J., Clune, J., Misevic, D., Adami, C., Beaulieu, J., Bentley, P. J., Bernard, S., … (2018). The surprising creativity of digital evolution: a collection of anecdotes from the evolutionary computation and artificial life research communities

Mehta, P., Bukov, M., Wang, C., Day, A. G. R., Richardson, C., Fisher, C. K., & Schwab, D. J. (2018). A high-bias, low-variance introduction to machine learning for physicists.

Drovandi, C. C., Grazian, C., Mengersen, K., & Robert, C. (2018). Approximating the likelihood in approximate bayesian computation

Morris, T. P., White, I. R., & Crowther, M. J. (2017). Using simulation studies to evaluate statistical methods

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon Statistical Significance

Kim, Y., & Li, J. (2018). Serverless data analytics with flint

Aaronson, S. (2011). Why Philosophers Should Care About Computational Complexity

Bacry, E., Bompaire, M., Gaïffas, Stéphane, & Poulsen, S. (2018). Tick: a python library for statistical learning, with a particular emphasis on time-dependent modelling

ArXiving on May 2018

David M. Blei, Alp Kucukelbir & Jon D. McAuliffe (2016). Variational Inference: A Review for Statisticians

Hanqing Zhao & Yuehan Luo (2018). An O(N) Sorting Algorithm: Machine Learning Sorting

Gavin L Simpson (2018). Modelling palaeoecological time series using generalized additive models

Thomas Hofmann, Bernhard Schölkopf & Alexander J. Smola (2008). Kernel Methods in Machine Learning

Victor Dibia & Çagătay Demiralp (2018). Data2Vis: Automatic Generation of Data Visualizations Using Sequence to Sequence Recurrent Neural Networks

ArXiving on July 2019

SciPy 1.0—Fundamental Algorithms for Scientific Computing in Python (Virtanen et al., 2019) arXiv:1907.10121v1

Techniques for Automated Machine Learning (Chen et al., 2019) arXiv:1907.08908v1

∂P : A Differentiable Programming System to Bridge Machine Learning and Scientific Computing (Innes et al., 2019) arXiv:1907.07587v2

Generalised linear models for prognosis and intervention: Theory, practice, and implications for machine learning (Arnold et al., 2019) arXiv:1906.01461

Please Stop Permuting Features An Explanation and Alternatives (Hooker et al., 2019) arXiv:1905.03151v1

ArXiving on March 2020

The Book of Why: The New Science of Cause and Effect arXiv:2003.11635v1

An Introduction to Probabilistic Programming arXiv:1809.10756

A Survey of Deep Learning for Scientific Discovery arXiv:2003.11755v1

Reinforcement Learning in Economics and Finance arXiv:2003.10014v1

ArXiving on June 2021

Scalable Econometrics on Big Data – The Logistic Regression on Spark (https://arxiv.org/abs/2106.10341)

maars: Tidy Inference under the ‘Models as Approximations’ Framework in R (https://arxiv.org/abs/2106.11188)

Choosing the Estimand When Matching or Weighting in Observational Studies (https://arxiv.org/abs/2106.10577)

A Bibliography of Combinators (https://arxiv.org/abs/2106.11729)

Robust Regression Revisited: Acceleration and Improved Estimation Rates (https://arxiv.org/abs/2106.11938)

Inference in High-dimensional Linear Regression (https://arxiv.org/abs/2106.12001)

ArXiving on August 2021

Distributions.jl: Definition and Modeling of Probability Distributions in the JuliaStats Ecosystem (https://arxiv.org/abs/1907.08611)

Principled, practical, flexible, fast: a new approach to phylogenetic factor analysis (https://arxiv.org/abs/2107.01246)

Preserving Diversity when Partitioning: A Geometric Approach (https://arxiv.org/abs/2107.04674)

Item Response Thresholds Models (https://arxiv.org/abs/2106.12784)

Comparison of Canonical Correlation and Partial Least Squares analyses of simulated and empirical data (https://arxiv.org/abs/2107.06867)

Calibrating the scan statistic with size-dependent critical values: heuristics, methodology and computation (https://arxiv.org/abs/2107.08296)

JAGS, NIMBLE, Stan: a detailed comparison among Bayesian MCMC software Authors (https://arxiv.org/abs/2107.09357)

On matching-adjusted indirect comparison and calibration estimation (https://arxiv.org/abs/2107.11687)

ArXiving on October 2021

Accuracy, precision, and agreement statistical tests for Bland-Altman method (https://arxiv.org/abs/2108.12937)

Generalized nearly isotonic regression (https://arxiv.org/abs/2108.13010)

Lagged couplings diagnose Markov chain Monte Carlo phylogenetic inference (https://arxiv.org/abs/2108.13328)

A Q-Q plot aids interpretation of the False Discovery Rate (https://arxiv.org/abs/2109.02118)

Dynamic Network regression (https://arxiv.org/abs/2109.02981)

High Performance Implementation of the Hierarchical Likelihood for Generalized Linear Mixed Models. An Application to estimate the potassium reference range in massive Electronic Health Records datasets (https://arxiv.org/abs/1910.08179)

Distcomp: Comparing distributions (https://arxiv.org/abs/2110.02327)

ArXiving on January 2022

Type Stability in Julia: Avoiding Performance Pathologies in JIT Compilation (https://arxiv.org/abs/2109.01950)

An Introduction to Quantum Computing for Statisticians (https://arxiv.org/abs/2112.06587)

Aitchison’s Compositional Data Analysis 40 Years On: A Reappraisal (https://arxiv.org/abs/2201.05197)

How ISO C became unusable for operating systems development (https://arxiv.org/abs/2201.07845)

ArXiving on July 2022

Marginal Effects for Non-Linear Prediction Functions (https://arxiv.org/abs/2201.08837)

Better to be in agreement than in bad company: a critical analysis of many kappa-like tests assessing one-million 2x2 contingency tables (https://arxiv.org/abs/2203.09628)

Selective inference for k-means clustering (https://arxiv.org/abs/2203.15267)

A tutorial for using propensity score weighting for moderation analysis: an application to smoking disparities among LGB adults (https://arxiv.org/abs/2204.03345)

Latent Trait Item Response Models for Continuous Responses (https://arxiv.org/abs/2204.03841)

Uniformly Valid Inference Based on the Lasso in Linear Mixed Models (https://arxiv.org/abs/2204.03887)

The replication of non-inferiority and equivalence studies (https://arxiv.org/abs/2204.06960)

Flexible Marginal Models for Dependent Data (https://arxiv.org/abs/2204.07188)

ArXiving on August 2022

Inverse Probability Weighting: the Missing Link between Survey Sampling and Evidence Estimation (https://arxiv.org/abs/2204.14121)

Redefining Populations of Inference for Generalizations from Small Studies (https://arxiv.org/abs/2204.14156)

Practical considerations for specifying a super learner (https://arxiv.org/abs/2204.06139)

PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections (https://arxiv.org/abs/2204.06077)

arXiv:1801.00364 Estimation and Inference of Treatment Effects with L₂-Boosting in High-Dimensional Settings, Ye Luo • Martin Spindler