September 2025
What’s going on on arXiv these days?
Here is my reading list for the past couple of weeks.
This is a brief review of deep learning in the light of the “renewed interest” for artificial intelligence that emerged during the last two years. I believe it does not reflect the opinion of all researchers or practitioners, but anyway there are some interesting references in there, and as the author said this is deliberately oriented toward AI research, not machine learning or data science. The author summarizes ten concerns regarding deep learning, which essentially turn around the generalizability issue. Here are some of them: data hungriness and no capacity to learn abstractions, limited capacity to transfer knowledge (beyond the scope that was used to learn the model), no way to deal with hierarchical structure or open-ended inference, black-box approach, no or poorly integration of prior knowledge, unresolved causation versus correlation issue and assumption of a stable world. In the last section, the author suggests some ideas to go forward: combined use of deep learning and unsupervised learning, use of hybrid model to integrate symbolic information, and gain more insights from cognitive psychology.
There was also a hot debate during NIPS 2017 on the interpretability issue in machine learning, see, e.g. Yann Lecun’s reply on Reddit.
The authors discuss Machine Learning (ML) methods that can be used to facilitate model selection and the estimation of treatment effect in economic policy analysis. In particular, they emphasize the double ML framework which they previosuly demonstrated to be useful to partial out the influence of potential controls. Basically, this amounts to estimate log price and sales conditional on lagged values of those variables and then apply some penalization (e.g., Lasso) in a second stage where we want to explain heterogeneous demand elasticities by regressing residuals of the first model on indicators for location, channel, product or product type.
The study of heterogeneous treatment effects is not limited to econometrics but applies to observational data that have to deal with causal inference and the idea of “potential outcomes”. It can be shown that the naive estimator of the average treatment effect (ATE) can be decomposed as the sum of E[δi] and pre-treatment and treatment-effect heterogeneity biases (See Heterogeneous Treatment Effect Analysis, by Jann, or Estimating Heterogeneous Treatment Effects with Observational Data, by Xie et al.). It is then essential to eliminate those bias, especially the first one.
This article deals with embedded methods in Machine Learning, but not really the one I was thinking about (i.e., embedded methods that perform variable selection during the learning phase of an ML algorithm) when I downloaded the paper. As many NIPS papers, it is rather technical and I had a very hard time following it. The core idea is that random ortho-matrices, that is structured gaussian or SD-product random matrices with orthogonal rows, can be used for dimensionality reduction (via orthogonal Johnson-Lindenstrauss Transform) or improving mean squared error of prediction (via angular kernel, which I never heard about before). I really have no idea of what I could potentially do with this theoretical contribution but it was a fun reading.
Personally, I never liked the term “data scientist” and I was quite happy with what Karl Broman put on his blog post:
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.
(But see also his followup post, I am a data scientist.)
In the present article, the authors define data science as the union of the following domains, after David Donoho’s article, 50 Years of Data Science: (1) data gathering, preparation, and exploration, (2) data representation and transformation, (3) computing with data, (4) data modeling, (5) data visualization and presentation, and (6) science about data science.
Many blog or online discussion are referenced throughout the article, as well as more scholarly published work. The authors emphasize the need for effective teaching of scientific reasoning, data visualization and statistical computing, and the importance of reproducible research, including literate programming, unit testing, or versioning, in applied research. Differing paradigms are also highlighted in this text: predictive and explanatory modeling, exploratory versus explanatory approaches. I learned about the 80/20 rule (this is not what I initially thought, that 80% of our time is spent in data munging and data exploration): “The basic idea is that the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%.” The rest of the article is a rejoinder to Nolan and Lang’s paper, Computing in the statistics curricula. You might also be interested in reading this article by Hardin and coll.: Data Science in Statistics Curricula: Preparing Students to “Think with Data”.
A final note: while I was looking for a certain reference via Google, I found the following PeerJ paper: Teaching stats for data science, written by Daniel T Kaplan. The ten building blocks (data tables, data graphics, model functions, model training, effect size and covariates, displays of distributions, bootstrap replication, prediction error, comparing models, generalization and causality) that the author recommend to start teaching “data science” are merely related to statistical computing and statistical inference, using the R statistical package, as far as I can tell, and so it is likely that the apparent lack of a consensual definition of data science still remain an open issue in 2018.
Per the paper abstract, the authors “give results for the valid inference of a treatment effect after selecting from among very many control variables and the estimation of instrumental variables with potentially very many instruments when post- or orthogonal L2-Boosting is used for the variable selection.” There are some connections with the arXiv:1712.09988 paper discussed above, in particular the estimation of treatment effect in high-dimensional settings relies on orthogonalized moment conditions, which are moment conditions imposed by a statistical model to estimate unknown parameters from a model using the generalized method of moments.
This is a very big paper (105 pp.) about probabilistic predictive modeling using supervised methods implemented in a Python library, skpro. This statistical library relies on sklearn and PyMC3 and it basically allows (per the doc):
This is a slightly revised version of a paper submitted almost 4 years ago that deals with efficient sampling of posterior distribution using MCMC. So-called scalable MCMC relies on either the MapReduce scheme or subsampling approach, which the authors used in their article with the addition of pseudo-marginal MCMC in the case of intractable likelihood: “at each iteration the log-likelihood from n observations is estimated unbiasedly from a random subset with m ≪ n observations, and the resulting likelihood estimate is then bias corrected to obtain an approximately unbiased estimate of the likelihood.”
TL;DR
Let us celebrate this new Spring with some papers fetched from arXiv, not necessarily from the last weeks but at least in reverse chronological order in my Twitter timeline.
A short paper describing a Shiny application allowing to fetch data from Pubmed and to apply topic clustering on the results using t-SNE followed by the hdbscan (R package) algorithm. Importantly, data can be saved locally if the user want to perform additional analyses.
Michèle B. Nuijten is the co-author of the statcheck R package, which offers to verify p-values reported in scientific articles. It was created to help assist in meta-analyzing a bunch of papers in the psychological litterature; see Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, 48(4), 1205-1226.
In the present paper, the authors assessed effect size and power reported in 131 meta-analyses in intelligence research. As expected from this type of research field where psychometric scales are used, Pearson’s correlation was often used to report effect size (42 studies) even if it might be wiser to use disattenuated correlation measure as recommended in the psychometric literature, or Cohen’s d (for comparison of pairs of means). All effet size measures were expressed as Pearson’s r. The results can be checked online. Correlations were found to be on average around 0.25, with an achieved statistical power just below 50% (only one third of all 2,439 primary studies reach the 80% power threshold) with worser results in behavorial genetics studies .
This is a big discussion of evolutionary algorithms which reminds me of The Nature of Code, by Daniel Shiffman (available on GitHub as well). I enjoyed reading his book two years ago, and now I am watching Daniel Shiffman on the coding train on Youtube from time to time.
The process of evolution is an algorithmic process that transcends the substrate in which it occurs.
In the present review, the authors discuss several lessons learned from bugs or failed attempts at demonstrating an expected outcome in scientific studies and which turned to bring interesting new ideas. They are presented as sort of “greatest hits” collection of anecdotes, according to the authors. For instance, here is an example that people involved in randomized clinical trials would love: In an experiment involving reinforcement learning using neural networks, the authors forgot to randomize the order of the stimuli such that in the end the neural network just learned to exploit the regularity of the pattern rather than to tune its connections to the edibility of each item.
Sometimes you come across some big work on arXiv. This is it! Here is a fresh new introduction to Machine Learning techniques, in a hundred of pages, for physicists this time. The companion website has everything to get started with iPython notebooks and the scikit-learn library. The applications cover basic algorithms like gradient descent and popular statistical methods: (regularized) linear and logistic regression, bagging and boosting, deep neural network using Keras, TensorFlow or PyTorch, and even variational auto encoders using Keras. I should note that the authors introduce ML as a subfield of AI without further ado (why is this is not considered as a subfield of statistics?) and that they “focus on prediction problems and refer the reader to one of many excellent textbooks on classical statistics for more information on estimation” (other than prediction, I have a hard time believing that black-box models like bagging and boosting can offer anymore than predictive performance over well tuned regression models).
This paper presents alternative approaches to approximate bayesian computation (ABC) which allows to compute a non-parametric estimate of the likelihood of a summary statistic based on computer simulation. This covers a bayesian version of the synthetic likelihood method proposed by Wood (2010) and the non-parametric empirical likelihood developed by Mengersen et al. (2013) The latter allows for the use of Bayes factor in model comparison and it also avoid computational simulation under certain circonstances.
This is a nice review of computational approaches to statistical inference with an emphasis on providing guidance for newcomers and experienced practitioners in biomedical research. Key concepts are detailed in the article but here are the main steps: planning and aims of the simulation study, methods, target of analysis, performance measures, coding, analysis, reporting. Regarding performance measures, it is interesting to note that among 100 papers published in Statistics in Medicine, bias and coverage of (1-\alpha) confidence intervals were the most frequently indicators studied by the authors, while type I error rate was only assessed in one-third of the studies. A working example based on the proportional hazards model (§7) is used to illustrate the various steps of a simulation study.
If you did not follow the buzz around statistical significance and p-value, here is a short paper for you.
(T)he status quo is that p < 0.05 is deemed as strong evidence in favor of a scientific theory and is required not only for a result to be published but even for it to be taken seriously.
Flint depends on AWS (S3 bucket) and Lambda. It allows you to use Spark without having to worry about installing Spark on your local machine, while Flintrock is a command-line utility that allows to launch and manage a cluster on EC2.
AWS Lambda is one of the next-generation service from Amazon. It let you run any code in a serverless environment according to a pay-as-you-go cost model: Amazon takes care of allocating ressources and managing parallel execution of your code. Moreover the execution can be triggered by external event, like in Travis CI. As an example, consider an image processing pipeline where each time an user upload an image on an S3 bucket, a Lambda to generate an image thumbnail is triggered. For a standalone application, take a look at Seth Brown BeerAI recommender system.
Still on my reading list.
Tick is a new Python package for Machine Learning of time-dependent events, including Hawkes processes (I know nothing about those processes but I like the name : smile :; for more details, see this review). It also features penalized linear and logistic regression and the paper includes a small benchmark of those models against their scikit-learn counterparts.
Here is a quick overview of some papers from the ArXiv. (And yes it was supposed to be published in May, and yes it is partly incomplete.)
When one wants to approximate a complex (posterior) probability density, Variational Inference is a great technique to find the closest probability density as measured by Kullback-Leibler divergence. Here is a short introduction on Princeton for those who want it short (see also this tutorial). Otherwise, I highly recommend spending some time with this review which deals with Bayesian mixture of Gaussians and mean-field variational family.
In this reasearch paper, the authors present a porbabilistic-based algorithm with a complexity of \mathcal{O}(N\cdot M), where M is small constant reflecting the number of neurons in the hidden layers of a Neural Network. Note that radix sort and counting sort are also \mathcal{O}(N) (up to a constant k, whether it be the number of digits to sort or simply their range). Note also that Don Knuth wrote a fabulous book on sorting algorithms. Now, this is the era of big data and machine learning, so if you care about those hot topics, check for yourself.
I read this preprint because it is written by a very bright guy I followed on Stack Exchange and his personal website for a long time now. I heard about that paper on Twitter. As I now he is quite proficient with R and well versed into the Generalized Additive Model (GAM), I was happy to take some time to review that paper. If you are interested in this kind of stuff, you should really take a look at his blog. This paper deals with statistical analyses of trends in palaeoenvironmental time series using GAMs. This family of models allows not only to provide smoothand cross-validated estimates of trend in time, but also to assess model uncertainty and, to some extent, to test formal hypothesis between alternate models. This paper serves both as an introduction to using GAMs in a correct and sensible way, but also as a hand on tutorial with real datasets.
This is an old publication that I found while looking to Bernhard Schölkopf’s webpage on MPI. In a few words,
Traditionally, theory and algorithms of machine learning and statistics has been very well developed for the linear case. Real world data analysis problems, on the other hand, often require nonlinear methods to de- tect the kind of dependencies that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a (usually high-dimensional) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evalu- ations, we never explicitly have to compute in the high-dimensional feature space.
Overall, I found that this article provides a thorough overview of kernel methods, and if you want an extended version, go check the definitive book written a few years before by the same authors:
This is a recent paper on automating the data visualization process, not from IDL this time. I always take a look around what the IDL lab from the University of Washington recently released, in case there is something new and as big as Data Voyager (paper). In my view, designing effective visualizations and guiding users in visual data explorations are separate aspects of automatic data visualization, although both rely on recommender systems to some extent. Several data viz software rely on a declarative specification, in the spirit of Wilkinson’s Grammar of Graphics, see e.g. the R ggplot package and Vega Javascript implementation. In the present paper, the authors used Deep Neural Networks (DNN) to build a generative model relying on the idea of a sequence to sequence translation problem. In fact, the training dataset they used was based on Vega-Lite examples for a range of basic charts and aggregating operations. Data2Vis is available online, but I wasn’t able to run the demo version due to server errors at the time of this writings.
Here is a quick wrap up of my latest readings on arXiv. In no particular order, as usual.
Everyone doing scientific computing in Python knows ScipPy, which reaches its v1 a few months ago, after years of development. Of course, this milestone was more of new stage in the continuous work that drives the Scipy package since its inception. In this paper, the authors explains the motivation behind SciPy, its relation with the numerical and graphical ecosystem already available in Python or under development (Numpy, IPython), what its 16 submodules containand why open-source developement for scientific computing using a general-purpose language really matters. Like R, a significant part of the Scipy package features Fortran and C code. Note that this paper does not discuss the scikit-* packages, which surely contributed over the years to enhance the data science stack available in Python.
This remains a distinguishing feature of Python for science, and one of the reasons why it has been so successful in the realm of data science: instead of adding general features to a language designed for numerical and scientific computing, here scientific features are added to a general purpose language.
While I first learned about automated ML and related techniques via H2O.ai and a few other packages, it looks like the rise of AI-related research (e.g., deep learning), the need to combine several classifiers (like in ensemble methods) or to apply complex preprocessing or feature enginering steps before building up a reliable model, and probably a bit of black-boxing all call for new method of managing ML models. In short, automated ML allows to build flexible pipeline that can perform automatic tuning of feature engineering, model hyperparameters, or even the deep learning machinery itself. Reinforcement learning and evolutionary algorithms (the same techniques we were hearing from back in the late era of AI in the 90’s) are popular techniques for automated ML, especially in the case of feature engineering. This review remains quite technical, and assumes you are already a good ML practitioner, IMHO.
Julia is a nice language, but I have never really been able to get involved in it. I mean, there are so many things that changed between version 0.4, 0.5, then 0.6, and now 0.7/1.0 that I didn’t really have time to follow or relearn what I was doing at the beginning (i.e., when everything had to be installed manually from Github). Anyway, it is a good thing that there are some good contenders to R or Python on the Market. In this paper, the authors describes ∂P, which is a system built on top of Julia to perform automatic differentiation–you know, that little thing that is used to transform code for one function into code for the derivative of that function. I am not particularly involved in AD, but I always find interesting to read about the potential of a PL in a particular field of applications, even if it is not my business–call this curiosity, or procrastination, or whatever. Some parts are very technical, either on the math side or on the Julia-related data structures, but the examples cover neural networks, computer vision, financial derivatives, and even quantum ML (never heard of it before reading that paper), and they are quite illustrative of how the ∂P system actually works.
As I said in my previous post, the all-in-one blind approach to ML has been challenged by Frank Harrell and ‘traditional’ statisticians at numerous occasions. Maybe, as some used to say, that Breiman’s “two cultures” paper hasn’t aged so well after all. This article deals with the application of GLMs in prognosis research, where statisticians are more concerned with prediction than explanation.
In health research, ML is often hailed as the new frontier of data analytics which, combined with big data, will purportedly revolutionise delivery of healthcare (e.g. through ‘personalised medicine’), provide new and important insights into disease processes, and ultimately lead to more informed public health policy and clinical decision-making.
This is indeed making a strong case of ML in health research, and I haven’t heard of any significant result in this direction so far. But gain, go check Frank Harrell discussion on this hot topic. The authors, however, chose to discuss the subtle issue of conflating prediction with causation, which even in the case of regression models is often overlooked: GLMs remain blind to the causal structure of the data, so do ML models. The authors then provide a thorough exposition of how to select and parameterize covariates in a predictive or causal model, and how each model may be assessed and interpreted in turn. It turns out that ML reasearch should also take into account this fine distinction, for “much of the promise of ML is predicated on prediction which prioritises the strength of (undirected) associations over the causal structures that give rise to them, and subsequently prioritises data-driven over theory-driven model selection.”
TLDR; When features in the training set exhibit statistical dependence, permutation methods can be highly misleading when applied to the original model.
Permutation-based measures of variable importance, partial dependence plots, and individual conditional expectation plots are cheap and parameter-free techniques to assess the predictive support of variables, or features, in a given ML algorithm. However, they are known to exhibit serious flaws when there exist intercorrelations between features. I used to illustrate some of the problem of partial dependence in a previous talk. Alternatives do exist, though, e.g. permutation measures of variable importance based on conditional distribution, leave-one-covariate-out technique, but they stick to the permute-and-repredict framework, while retraining a model after permuting feature of interest might provide an interesting way to reduce the importance of correlated features, albeit at a larger computational cost.
Here are some papers that I read this week.
This is a fair review of the Book of Why that I started reading a while back before forgetting about it because I’m no longer involved in causal analysis, and not curious enough to keep up with everything that’s going on around here. Anyway, graphical causal representation under the umbrella of direct acyclic graphs (DAGs) are supposed to be the new starting point for causal analysis. The reviewers remain, however, critical of the Judea Pearl’s optimism since his approach mostly relies on the assumption that the underlying causal structure is known.
We believe this optimism is unfounded. The central problem that scientists face, especially in the social sciences, is not how to express or analyze causal models but how to pick one that is valid or at least reasonable. The book does not claim to solve this problem.
I would say that the same criticism applies to all regression models insofar as the (frequentist) interpretation value is valid (and meaningful) conditional on the fact that the model is correctly specified. Another criticism, this time, aimed at denigrating the usefulness of the experimental approach, like randomized clinical trials or experimental design at large. Finally, DAGs are not the only appraoch to causal modeling, according to this review, for several alternative causal models, like nonparametric structural equation models or design-based approaches, exist and may be able to answer the specifics of causal discovery.
I’ve read several papers or arXiv reviews on probabilistic programming (PP) lately. I mentioned this one already. Briefly, it provides an extensive review of the core ideas of PP, namely the importance of conditioning in probabilistic machine learing and the use of dynamic computation graphs. Interestingly, this uses the Clojure dialect as the main implementation. This should come as no surprise since some of the authors are actively involved in the Anglican toolbox.
Related threads: A tour of probabilistic programming language APIs, Probabilistic programming with Monad-Bayes.
This review is structured along two ideas: to provide an overview of many widely used deep learning models, spanning visual (CNNs), sequential (RNNs) and graph structured data, associated tasks (image segmentation, super-resolution, sequence to sequence mappings and many others) and different training methods, and to highlight techniques that use deep learning (DL) with less data (e.g., self-supervision or semi-supervised learning) and better interpret these complex models. In my own view, DL is a useful technique when some kind of regularities can be exploited in the data — including bias if any, and when the training size is large. This generally holds in the case of pattern recognition tasks like NLP or image processing, but I remain sceptical about the application of this type of model in tasks for which more traditional supervised methods (e.g., logistic regression) would be just as suitable. There are also lot of links regarding tutorials and software, that are not necessarily limited to DL.
I have no competency in finance, and I barely have notions in economics, but I always enjoy reading one of Arthur Charpentier’s review, for they are always very picky and highly anchored in historical considerations. The last one I read was about Machine Learning, and it was really englightening. I am in the middle of the present review, and I already learned a lot on the parallel between traditional ML and RL with respect to loss (regret) minimizaetion, bandits, dynamic programming, or even Thompson sampling strategy.
Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.
I managed to try some Spark and H20
examples a long time ago, but this was really for toy projects. I’m
usually happy with R or Stata for statistical modeling. However, this
article brings interesting insights into large scale data processing
using Spark, together with an empirical study of Python+Spark ML versus
R in the case of logistic regression. Apparently, the new Spark ML
library exposes two different versions of the logistic regression, one
direct method (ml.classification.LogisticRegression
) and
the general linear model approach
(ml.regression.GeneralizedLinearRegression
).
Another article dealing with statistical modeling, this time under
the OLS framework. Precisely, this article is about a new R package,
maars
, which offer resampling-based variance estimators and
pretty printing of summary results, along tidy syntax. I’ve never been a
great fan of the tidy family of foo’s, but your mileage may vary. The
visual diagnostic addons (“focal slope” and “nonlinearity detection) are
probably the best aspects of this package.
Commonly used estimands include the average treatment effect in the treated (ATT), the average treatment effect in the untreated (ATU), the average treatment effect in the population (ATE), and the average treatment effect in the overlap (i.e., equipoise population; ATO). Each estimand has its own assumptions, interpretation, and statistical methods that can be used to estimate it. This article provides guidance on selecting and interpreting an estimand to help medical researchers correctly implement statistical methods used to estimate causal effects in observational studies and to help audiences correctly interpret the results and limitations of these studies.
With a background in experimental design and behavioral measurement, it took me a long time to get familiar with econometric and epidemiological vocabulary when I was working as a biostatistician in the past 15 years. That being said, this article offers an overview of the different types of estimands available, and how to decide which one to use depending on the question at hand. For instance, “if the question concerns a policy of withholding treatment from those who would currently receive it, this suggests the ATT is of interest. If the question concerns a policy of expanding treatment to those who would not currently receive it, the ATU may be of interest. If the question concerns a policy that would require all patients to be treated or to be untreated, the ATE may be of interest. If the question concerns a policy of prescribing treatment for patients under uncertainty, the ATO may be of interest.”
Stephen Wolfram published lengthy posts on his “personal blog”, and here is a bibliography memento on combinators, lambda calculus & Co. available on arXiv. See also Combinators: A Centennial View.
This article presents the theoretical implications of contaminated data points in computing maximum-likelihood estimates for parameters of a regression model. So-called highly robust regression is then used as it proved to be a “flexible model of corruption and can be used to study both truly adversarial data poisoning attacks (where e.g. part of the dataset is sourced from malicious respondents), as well as model misspecification, where the generative D_{X_y} does not exactly satisfy our distributional assumptions, but is close in total variation to a distribution that does.” The article is rather technical, but I learned, for instance, that there exists a nearly-linear time algorithm for robust linear regression with residual variance bounded by \sigma^2.
In the n\ll p settings, parameter estimation in a regression model becomes complicated. There are tons of approaches to perform variable selection or to reduce the dimensionality of the dataset, and in this case the authors discuss a marginalization approach to this problem: each regression parameter is estimated while treating every other parameters as nuisance factors, much like when residualizing predictors of poor interest (e.g., a set of common covariates). The interesting part of this approach is that the regression coefficients remain on their original scale, unlike regression coefficients estimated using Lasso or Ridge penalties.
Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.
I reinstalled Julia (v1.6.1) a few months ago. Yet I haven’t had
enough time to learn the language again. I saw there’s a native plotting
backend, which is really great, and that there’s some sort of JIT and on
first load compiling capabilities. It is supposed to improve later
execution of scripts or programs, but really on first launch it is
sometimes a pain. However, the language looks much better now, although
I keep thinking of the old Julia 0.4 that I used to use several years
ago now. This article describes the Distributions.jl
package, which is part of the JuliaStats ecosystem. Note that this
package allow to define and not only to manipulate pre-defined
statistical distributions. Specifically, “a Distribution
is
an abstract type that takes two parameters: a VariateForm
type which describes the dimensionality, and ValueSupport
type which describes the discreteness or continuity of the support.” I
really need to give Julia another try, especially now that I can pretend
to be a stat dilettante. And I would like to compare how Julia’s
Distribution package compared to Boost’s
own routines.
Yet another attempt at modeling uncertainty using a latent variable approach. This is a dense article about phylogenetic analysis of high-dimensional phenotypes using BEAST. Usually, some form of data reduction is used to decrease the space and time complexity of model fitting on such dataset. As in other statistical domains, the two main approaches consist in using either PCA-based or FA-based techniques. In the later case, we talk about phylogenetic factor analysis (PFA), and the authors discuss two alternative approaches to the existing methods which scale quadratically with the number of taxa and remain intractable for large trees. As I said, it is dense, and it’s still being read on my side, especially since I’m looking into BEAST documentation at the same time. I only used it quickly two years ago before settling on the ETE toolkit.
This article deals with the idea of partitioning a sample of individuals having different characteristics (which define their diversity as a whole) into subgroup exhibiting a similar diversity index. The diversity index considered here was introduced by Simpson, and it corresponds to the inverse probability that two individuals are from the same type when taken uniformly at random, with replacement, from the community of interest. The paper is rather challenging, since it exposes the mathematical aspects of optimal partitioning and algorithm-related approaches. The authors’ conclusions are that a perfect partition exists only when the number of parts k divides the gcd of b. We also design a polynomial time algorithm for the case of r = 2 types.
I’m no longer into psychometrics – last workshop I organized was in 2016, last blog post around 2013, but this article resonated to me since I’ve been working on IRT models for polytomous item for quite a long time. Problem such as item with threshold reversals, response saturation, and the like were common pitfall at that time. Go take a look if you are versed into modern psychometrics.
I haven’t heard about multi-block methods for a long time, and this paper discusses the result of extensive simulation using incremental subsampling to assess sensitivity and reliability of those two techniques. From the abstract, the take-away message reads as follows: First, for data reasonable within and between block structure, CCA and PLS give comparable results, with equivalent sensitivity and false positive rate. Second, if there are high correlations within either block, this can compromise the reliability of CCA results. This known issue of CCA can be remedied with PCA before cross-block calculation. However, this assumes that the PCA structure is stable for a given sample. Third, statistical significance by null hypothesis testing does not guarantee that the results are reproducible, even with large sample sizes. This final outcome suggests that researchers should routinely assess both statistical significance and reproducibility for their data. I’m not surprised by these conclusions, since the within- and cross-block correlation structures plays an important role in both models, and often help in deciding which method to chose. Further, my own past investigation were quite in agreement with the idea of varying stability of the results depending of these correlation structures and the sample size itself.
It’s been a long time since I last used JAGS, and I didn’t install it on my last Macbook. I never heard of NIMBLE before reading this review. Apparently it supports Bayesian models written in BUGS (or it can compile models written using R syntax to C++). The Stan documentation is among the best free textbook available nowadays. Apparently, the compile time for Stan and NIMBLE is of the same magnitude (2-3 minutes), but NIMBLE appears faster at the runtime level compared to the two other candidates. Finally, NIMBLE seems to perform quite good when it comes to estimating parameters for mixture models.
The idea is that matching-adjusted indirect comparison, which is commonly used in RCTs to compare non-directly comparable pool of patients, share some resemblance with calibration estimation as used in survey sampling or epidemiological studies. This technique allows to use weights with minimal variance to balance covariates between the sample and target populations. In the case of RCTs, if a common treatment is available between the different samples, then it can be used as an anchor to analyze within-study differences, although this is not the main approach discussed in this article. The author provides extensive simulation results (R code available), and concludes that among the numerous approaches that can be used to construct appropriate weights, empirical likelihood weights allow to derive likelihood-based confidence intervals for treatment effect that are often more reliable than asymptotic ones.
Here are some papers that I read this week, in the CS and Stat category, plus random stuff that were mentioned on IRC or Hacker News.
I’ve used Bland-Altman plots a lot when I was working on diagnostic tests and psychometric measurement. Interestingly, few medics really understood how to interpret those kind of graphical displays, and how they are used in method comparison studies. I suspect they also dislike this approach since it does not provide formal tests of hypothesis. Here we go, with three tests for accuracy, precision and agreement.
Fond memories of Jan de Leeuw’s (and the GIFI nom de plume) own work.
A simple yet efficient graphical method to assess significant p-values computed using Benjamini-Hochberg approach.
The FDR is a property of a set of p-values, not of any individual value.
This article deals with Fréchet regression on network data, using the corresponding space of graph Laplacians (with no restriction on their rank). It has two applications (NY taxi trips after COVID-19 and dynamic networks in aging brains) which I found quite interesting. Such an approach might prove superior to classical data mining on spatio-temporal data structures since we can capture both the structure and weight information of a networks given relevant covariates, and assess the quality of adjustement for projected trends.
Not much of an interest for me except that it deals with Potassium reference level which is a subject I have been familiar with (for personal reason, unfortunately) for several years now. This remains, however, an interesting simulation study for mixed-effects models and big data aficionados.
This paper introduces a Stata command that allows to compare
empirical CDFs. Contrary to the Kolmogorov-Smirnov approach,
distcomp
allows to test the equality of the distribution
functions point by point., using appropriate FWER control, and displays
ranges of values in which the distributions’ difference is statistically
significant.
Here are a few papers that I read over the past two months, in the CS and Stat category as usual.
At the time of this writing you can replace the x in arxiv URL to get a pretty HTML rendering of the article. See below for the second article discussed above.
I still haven’t really gotten into the Julia language. I follow the news here and there and I’m glad to know that there is now a native graphical backend, not just web output. The last time I tested Julia (v1.6), I found the runtime of a first script to be quite substantial, although I understand why that is the case. The language server is slow too, because of the initial indexing of the symbols contained in all the installed packages. Nevertheless, I think Julia has a future and I hope there will soon be a consolidated set of tools for statistical modeling, much like Frank Harrell’s rms package. As I said in a previous post, for the moment I want to investigate Racket and Lisp-like languages, or maybe Haskell, to see if they could provide a good fit for my own workflow (even if I rarely perform statistical analyses nowadays). Julia remains, however, an interesting approach to statistical computing nowadays.
Just out of curiosity, and because I was reading Introduction to Classical and Quantum Computing (PDF) by Thomas G. Wong, I fetched this paper on quantum computing for statisticians. Well, I haven’t read it actually, but this look interesting as far as I can tell.
It’s been a long time since I last read a paper written by Michael Greenacre. Last time was probably when I was working on multi-block methods back in 2016. This paper discusses the original approach to compositional data analysis, which is usually associated with lot of zeros in the dataset, and suggests that, contrary to the classical use of isometric logratios, soft constraints may yield correct and almost identical interpretation of the data in most cases. I have only skimmed the article but I will probably come back to it if I need to go deeper into the subject and the relationship between this type of approach and the techniques based on correspondence analysis.
The take-away message is that the primary reason why ISO C is poorly designed for OS development relates to a “design approach in the ISO standard that has given priority to certain kinds of optimization over both correctness and the ‘high-level assembler’ intentions of C, even while the latter remain enshrined in the rationale.” Among other things, pointers (casting from and aliasing to) remain problematic in several edge cases.
I still use C because this is one of the first languages I learned when I was in grad school (after Turbo Pascal, you named it). I like it because of its syntax and its simplicity, but I usually only write toy programs so I don’t have to bother with all its “pitfalls” regarding low-level and hard-core programming. I don’t use C++, but I’m rather interested in learning Rust because I like the tooling as well as rust-analyzer. Yet, I know it’s a hard path.
Here are a few papers that I read over the past months, in the CS and Stat category as usual.
For everything related to marginal effects, I usually trust Stata margins command. By the way, there is a dedicated book for this very useful command written by Michael N. Mitchell. The present article deals with estimating marginal effects in the case of non-linear models, which usually involves derivatives as discussed in an old post. Here the authors argue in favor of forward differences, hence the approach coined “forward marginal effects.”
I once gave a brief overview of recommend techniques for the cross tabulation of two categorical variables. This article goes a lot deeper by discussing agreement measures in 2x2 tables, inlcuding the cases of extreme tables where there’s a high level of agreement or a large imbalance between cells. The authors suggest that Holley and Guilford’s G and Gwet’s AC1 are the best candidates, among many other agreement measures (Cohen’s kappa, Pearson’s r, Yule’s Q and Y, Scott’s \pi, Shankar and Bangdiwala’s B, Dice’s F1 and McNemar’s \chi^2).
Unsupervised learning does not lead naturally to statistical inference, except maybe in the case of assessing clustering stability or choosing the optimal number of clusters. I discussed resampling-based appraoches in another post. However, with added constraints like in Mclust, we can work out the likelihood and optimize the number of clusters, for instance. Another inferential issue is that of testing whether cluster members differ on average on some attributes. Usually, we don’t test attributes that were used to build the partitioning: it is often meaningless since we are usually trying to maximize the separation between clusters based on those very specific attributes, and in any way this would lead to inflated Type I error rate. Most of the times, we establish and eventually test clustering profiles based on held out variables. In this paper, the authors propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
This aricle is in fact mostly a tutorial on moderation analysis using PS weighting, and illustrations in Stata (inline code) and R (external package) are provided.
Usually, latent trait models apply to categorical response variable (dichotomous or polytomous items, or rating scales). The objective is to build a continuous score reflecting the position of an individual on the ltent trait being measured. For a more detailed overview, see my answer on Cross Validated. This article deals with restricted continuous responses (e.g. positive only or interval-based responses), and extends response time models.
The authors use restricted maximum likelihood (REML) estimators to separate the estimation of the regression coefficients and covariance parameters in order to estimate fixed-effects with L1 penalization in Gaussian linear mixed models.
The authors use the TOST technique to analyze replication equivalence studies. A “success interval” for the relative effect size is used to compare the replicate to the original study effect.
An interesting paper to complement one companion textbook I worked on long time ago.
Here are a few papers that I read over the past weeks, in the CS and Stat category as usual.
Inverse probability weighting is used in survey and epidemiological analysis. I only used this technique once, and it was in quick comparison with propensity score analysis, where the key idea was in both case to achieve balance between treated and control subjects in order to reduce confounding. See Chan, K. C. G., et al. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting (Journal of the Royal Statistical Society: Series B 78(3), 2016) for recent work. In this article, the authors discuss the use of binning and smoothing, or, as an alternative, hierarchical modeling. Their conclusions are that those Bayesian estimators perform better (i.e., they have lower mean squared errors) compared to simple IPW estimators.
This article deals with the inference population that small sample studies usually focus on, and how it can be improved to foster valid generalizations (of the population average treatment effect, or PATE). This is based on the results from a cluster randomized trials in educational assessment. The two approaches considered here are coined “quantitatively optimal” and “policy-relevant” approaches. In the former case, the analysis relies on propensity scoring with varying cutoffs or the distributions of covariates. In the latter case, the idea is to identify subgroups that may benefit from the intervention. To improve imprecision, bot approaches can be undertaken, while for bias reduction only the quantitatively optimal approaches seem relevant.
Along with classical parametric regression models or ensemble methods in machine learning, the super learner algorithm allows to learn a prediction function without needing to set on a fixed modeling approach: as an entire pre-specified and adaptive strategy, it allow the modeler to consider several predictive models at once. Here’s the big picture:
For the practictioner, the core decisions are to (1) specify the performance metric for the discrete super learner, (2) derive the effective sample size, (3) define the V-fold cross-validation scheme, (4) form library of candidate learners, and (5) to use discrete super learner to select the candidate with the best cross-validated predictive performance.
This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks using arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC-trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient.
See also Parallel Augmented Maps.
This article describes solutions to correct for the biais arising from overfitting datastes where p/n < 1. Besides regularization techniques, the asymptotic characteristic function of the ML estimators can be used to estimate the overfitting bias using only a small set of non-linear equations. The maths are dense, and I will probably need to reread the hard part.
At first sight, it sounded really interesting, especially since the authors used bootstrap, and perhaps with bioinformatics applications in mind:
Persistent homology (…) produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data.
Then, I realized that it’s a hard paper (for me at least). I’ll keep it under my hat for later, when I’ll have clearer ideas.
I’ve been teaching the basis of recommender systems in an engineering school for two years. I liked it, especially when discussing matrix factorization algorithms and Co. Applications speak for themselves (everybody knows Amazon or Spotify, right?), and I missed Seth Brown’s beer recommendation engine. This paper is more about social networks and preferential attachment (rich-get-richer and homophily). The authors show that the number of vertices in their K-groups preferential attachment model follows a power law, and they dscuss two applications on real web data.
Locality sensitive hashing (LSH) is used to hash alike items into the same buckets with high probability, which means it can be used as a data reduction or clustering method, bypassing any supervised learning process. However, this usually does not apply to ordered items, like trees, graphs or sequences. The authors survey relevant techniques for such hierarchical LSH algorithms.
Yet another paper on log-linear analysis, this time in the context of mediation analysis (a hot topic in psychiatric and psychological studies in the last ten years when people realized that causal analysis may or should not be that easy).
Here, the author discusses the use of parametric or non-parametric simultaneous confidence intervals between the levels of a primary factor for both separate and joint secondary factor levels in an unbalanced factorial design with 2 or 3 factors (where one of the factor has more than 2 levels).
We can, and should, do statistical inference on simulation models by adjusting the parameters in the simulation so that the values of randomly chosen functions of the simulation output match the values of those same functions calculated on the data.
It reminds me of raking or margin adjustment in survey analysis.
(I know, I know, it’s a bit late for a January review.)
The Box-Cox transformation is one of the many tools one can use to ease traditional linear modeling of data.Semi-parametric regression, whihc I never really used for real work is also an option. Flexible regression methods, as emphasized in Harrell’s RMS textbook1 or the use of generalized additive models2 are another. I’ve been doing some research on orthogonal polynomial, restricted cubic splines, and other spline fitting procedures, and then I found this paper. It introduces different frameworks (Truncated Power, B-spline, and P-Spline), and a quick application on a time-series dataset.
Looks interesting at first sight, but I haven’t had time to read it. Maybe you will be interested, though.
Unlike the parametrizations classically used in Machine Learning, the parametrizations introduced here are all bijective and are even diffeomorphisms, thus allowing to keep the important properties from a statistical inference point of view, first of all identifiability
This is a 300-page long tutorial on probabilistic programming written by one of the authors of Anglican, which seems like a staled project at this time. Still, it is a very valuable account of probabilistic programming at large and a peculiar implementation targeting the JVM. See also these slides.
This is a rather long monograph on RDDs. It covers local randomization (which assumes that the assignment of units above or below the cutoff was as if random inside a given window and that in this window the potential outcomes are unrelated to the score), fuzzy RDD (which accounts for non-compliance), and discrete and multidimensional designs. There’s a preliminary monograph from the same authors that only deals with the classical RDD, as well as a practical for medical applications.
Programming is ubiquitous in applied biostatistics; adopting software engineering skills will help biostatisticians do a better job.
A few months ago, I read on a blog that statisticians were very bad at programming because they lack skills in software engineering. Let me suggest two additional remarks: applied statisticians must provide results, quickly whenever possible, and when they can use agile programming they do, most of the time. This does not mean they will deliver a product of industrial grade, but if they can reproduce their own results six months later, this already is good news. I guess many software engineers never looked at statlib or established algorithms; you can use TDD, CI, regression tests and the like; if your algorithm is far from the standards, it won’t do any good.
Sometimes, I feel like I’m reading the same story every year: “biomarkers”, measurement error, practical significant difference, etc. Ioannidis warned us a long time ago.
The authors hereby suggests “probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions.” His conclusion is that the Brier score is a good candidate. The problem with discrimination indices is that they perform poorly when the observed prevalence of the outcome is very low (which happens pretty often in medical applications) despite highly significant variables.
I learned about Dex on Darren Wilkinson’s blog. Later on, I learned about Futhark, but see A comparison of Futhark and Dex. I never used any one of these DSLs, but I warmly recommend reading what Darren Wilkinson has to say about those frameworks.
Since I am no longer involved in psychometrics or this kind of stuff (even on Cross Validated), I rarely read this kind of papers nowadays. Selecting misfitting items on a scale has a long history, especially for Rasch modellers. This paper exposes a sort of clustering method that allows to flag item, based on the variance of a mixing distribution, that do not belong with a set of items sharing a commin trait.
Still a memory of my previous job where I used to use optimal scaling a lot, but in the context of projection methods in multivariate exploratory data analysis. Here, Meulman and collaborators consider optimal scaling as an alterantive framework to GAM to allow non linearity in discrete predictors. In particular, this alleviares the need for dummy coded variable levels, via quantization with monotonicity constraints, which facilitates the interpretation of the resulting output. There will probably an R package available at some point. See also ROS Regression: Integrating Regularization with Optimal Scaling Regression.
A complete course on graph theory, including network flows, for graduate students. It is rather extensive (422 pp.) and there are a lot of illustrations.
This is an extension to the permutation testing framework to the case where the number of predictors exceeds the sample size. In classical settings (p < n), the startegy amounts to compute the test statistics under random permutation of the residuals.3 When n \ll p, however, regularization methos like the elastic net appoach msut be used. The authors recall that for minimizing prediction error, ridge regression is often preferrable to Lasso, principal components regression, variable subset selection and partial least squares. Moreover, ridge regression is close to the Freedman-Lane approach, which is based on semi-partial correlations. The authors finally suggest to use double residualization, which is inspired by the Kennedy method, which residualizes both Y and X and proceeds to permute the Y-residuals, 4 but in this paper the authors replace the least squares regression by ridge regression.
This is a short review of power analysis for k-means, Ward agglomerative hierarchical clustering, c-means fuzzy clustering, latent class analysis, latent profile analysis, and Gaussian mixture modelling. Results based on simulated datasets are summarized in Table 1 of the paper, reproduced below.
In this paper, the authors discuss the use of principal component, principal covariates, and partial least squares regression, instead of unsupervised PCA, with MICE. Results show that supervised appraoches perform better, and that supervised principal component regression has smaller bias and better confidence interval coverage for a wider range of retained components, independent of the number of latent variables.
It’s a little bit funny to see the name of Ronald Fisher associated to AI in the title of scientific paper, especially since AI now means LLMs and other brilliant foundations for the future millennium. Anyway, the iris dataset is used to illustrate the use of a discriminant function (as the ratio of squared mean and variance) to demonstrate whether “when we use the linear compound of the four measurements most appropriate for discriminating three such species, the mean value for I. versicolor takes an intermediate value and, if so, whether it differs twice as much from I. setosa as from I. virginica, as might be expected, if the effect of genes are simply additive, in a hybrid between a diploid and a tetraploid species.” 5 Hence, a genetic hypothesis, which can be exploited to find the optimal linear combination of population means, \mu_1 -3\mu2 + 2\mu_3 = 0 (respectively, setosa, versicola and virginica), was at the heart of Fisher’s seminal paper, like many of his other contributions. Sadly, the iris dataset is mostly used in the context of unsupervised classification (k-means, for instance), as if people forgot that it relies at its heart on a well-defined hypothesis (much like people forgot or still ignore that Student (Gosset’s nom de plume) developed his test while working in a brewery). It seems the point of the authors is the notion of a discriminating rule, which, when applied to AI problems, amounts to a supervised approach that aims to find a “rule to separate distinct populations, and then to use that rule (or a different classification scheme) to allocate a new observation to the given populations.”
This article summarizes a long list of data repositories spanning a dozen or so domains (including satellite data or sports, for instance). I will keep it on hand just in case I need some dataset one day.
Propensity scoring is widely used in the absence of a controlled
experiment, e.g. in observational studies, where as the authors say
matching is used to pair treated and control units with similar value of
potential confounders (aka, the propensity score). Usually, matching is
done with replacement, that is a control is paired to several treated
units (1-to-n matching, with replacement). There are other
estimators (e.g., inverse probability weighting or doubly robust
methods), though. Oversampling
and replacement strategies may alter the balance between bias and
variance. In the case of matching without replacement, greedy approaches
are used; in Stata, there’s a command calipmatch
which does
the job for you, while in R the /de facto/ packages are probably
Matching
and MatchIt
. In this paper, the
authors review alternative approaches, namely optimal statistical
matching (linear balanced and unbalanced assignment problem, maximum
cardinality matching, and minimal cost matching).
Here is another paper by Achim Zeilis on regression trees. This time, it deals with “extended” GLMMs for identifying subgroups in growth curve modeling. It’s been a long time since I first used partykit for tree-structured regression and classification models, and this was in a benchmark with classical CART, logic regression, and ensemble methods (random forests and boosting); this didn’t end up in a paper, but it was a fun exercise. Since then, a lot of work has been put forward into this framework and I’m not really up to date. Unlike other methods for partitioning latent growth curves, this modeling framework set up a mixed approach: “GLMM trees do not fit a full parametric model in each of the subgroups defined by the terminal nodes of the tree. Instead, fixed-effects parameters are estimated locally, using the observations within a terminal node, and the random-effects parameters are estimated globally, using all observations.”
I have used FDR correction for multiple comparisons here and there,
as an alternative to sequential methods like Holm’s approach (which is
the default in R’s p.adjust
, btw). The authors offer an
alternative data-driven method to the classical Storey’s method (i.e., a
family of true null proportion of false null hypotheses estimators) for
estimating the FDR. In this case, the sorted p-values are approximated
using a piecewise linear function with one change point (inversion of
slope), as illustrated below. This is somewhat akin to the knee/elbow
method or scree plots for determining the number of principal components
or factors to retain in PCA or FA, which is discussed in section
C.1.
Yet another article discussing the benefit of using bootstrap to assess model goodness-of-fit. Rather than visual inspection of so-called residual plots (residuals vs. fitted values), especially in multivariate regression, or relying on either the White or Breusch-Pagan test. The test presented here allows to test all assumptions of normal linear regression:
For test level \alpha, candidate normal linear model \mathcal{M}, bootstrap iterations B, sample size n, and a null hypothesis that \mathcal{M} is adequately specified:
As a sequel to this 10-years old post, here is a paper that uses genetic variants as instrumental variables (2-stage predictor substitution and the 2-stage residual inclusion) to estimate causal effects in GWAS on lifestyle and environmental factors.
I’m still not a big fan of the term data science (as a discipline), and I much prefer statistician or data engineer, but let’s forget about that for a moment. In this article, Rafael Alvarado offers to extend the consensual view according to which data science lies at the intersection between three main domains of expertise: computer science and technology, mathematics and statistics, and domain knowledge, which I found even more confusing. Specifically, he considers the following four main domains: value, design, systems, and analytics, to which a fifth domain may be added, namely practice. In a related article, his review led him to conclude that the three driving factors of all the buzz around the sexiest job the 21st century are “(1) the long-standing opposition between data analysts and data miners that continues to animate the field, (2) an established definition of the term as the practice of managing and processing scientific data that has been occluded by recent usage, and (3) the phenomenon of data impedance.” Much ado about nothing in my opinion, unless you consider that the volume and nature of data being processed these days implies a major change in terminology. But is it really that important if it’s just a question of terminology? Because the job remains fundamentally the same, after all.
A paper I read last year, thanks to Evan Miller, but go read his own blog post which covers pretty all the math. While the authors originally proposed a Poisson model for bootstrapping differences in quantiles, whereby each observation in the original sample is included in each bootstrap sample according to a Poisson distribution with parameter 1, Evan Miller suggested an approximation which involves a Bessel function.
A new paper by Dianne Cook, whose contributions I have always greatly appreciated since I discovered ggobi (the website is still alive apparently). The “grand tour” (and other derivatives) is a method that allows to view data in more than three dimensions using linear projection, essentially by using rotations of a lower-dimensional projection in high-dimensional space. I’ve only seen this implemented efficiently in ggobi, but there may have been other attempts. Frames are stacked together with small interpolation to create a smooth motion. However, projection are usually rotationally invariant (when using geodesic interpolation algorithm) which may not be a desirable property when, e.g., we are looking for non-linear association in a high-dimensional dataset. The present package works along the tourr package.
Although I’m no longer involved in medical statistics, I keep an eye open on data visualization and applied methodology in the field. This article drew my attention a few weeks ago because it relies on multivariate projection methods, in this case correspondance analysis and biplots. There’s an associated GitHub repository where you will find code (shiny and ggplot) and illustrations. The advantage of biplots as summary displays is that they help highlighting complex relationship (or, in some cases, interactions). For instance, in the associated BMC paper the authors published alongside this preprint, they used biplots to highlight discrepancies between single agent treatments and their combinations with oxaliplatin at all levels of AE classes, which explain most of the variability among treatments.
Contrary to Gaussian models or deep neural networks, which show poor performance in terms of identifiability, interpretability, and reproducibility, the authors propose to use the “NIFTY framework, which parsimoniously transforms uniform latent variables using one-dimensional nonlinear mappings and then applies a linear generative model.” Briefly stated, such an approach allows to work with a standard linear latent factor model, which is easier to interpret compared to the aforementioned models, which amounts to a Bayesian independent component analysis under certain conditions.
An interesting review of algorithms available to generate permutations, and specifically signed permutations which are permutations of [n] objects with a \pm sign. In the case of permutations, there are n! solutions, while for signed permutations there are 2^n\cdot n! solutions. Not something I am really interested in at the moment, but I’ll keep it in mynd in case I need it. There are a lot of nice illustrations, BTW.
Old reads…
I rarely use Postgres for my personal hobbies (I usually find that SQLite provides me with convenient cost-effectiveness tradeoff for pet projects), but I always appreciated it for long term projects as well as its extensions. This paper is a recollection of the early days of Postgres in the 80s.
Postgres was Stonebraker’s second system, and it was certainly chock full of features and ideas. Yet the system succeeded in prototyping many of the ideas, while delivering a software infrastructure that carried a number of the ideas to a successful conclusion. This was not an accident—at base, Postgres was designed for extensibility, and that design was sound. With extensibility as an architectural core, it is possible to be creative and stop worrying so much about discipline: you can try many extensions and let the strong succeed.
I wrote about Mendelian randomization a while ago. This paper is about its multivariate extension in the context of causal inference using instrumental variables in a study of genome-wide significant GWAS loci. When the instrumental set condition is satisfied, direct causal effects of all exposures in the model on the outcome variable of interest can be identified.
If you ever applied Bonferroni correction (or any stepwise procedure) for multiple comparison, you know that we assume that tests are independent, and in the case of Bonferonni it may be way too much conservative. This is even more complicated in the case of online testing. An alternative approach is proposed in this article, namely an adaptive version of the alpha-spending algorithm used, e.g., in interim analyses, 6 where we update the p-values depending on whether they are believed to have arisen under the null hypothesis or the alternative.
A Computationally Efficient Approach to False Discovery Rate Control and Power Maximisation via Randomisation and Mirror Statistic (<https://arxiv.org/abs/2401.12697) is yet another paper on FDR rather than FWER control. See also Gridsemble: Selective Ensembling for False Discovery Rates for related to estimating local false discovery rates in large-scale multiple hypothesis testing.
I barely used quantile regression in my previous life (aka when I was working as a full-time statistician), except in a few cases when dealing with growth curves and newborn data. I came across this paper because I’ve been a long time follower of Evan Miller’s posts. He’s actually discussing how to implement a two-sample difference-in-quantile hypothesis test and how to construct confidence intervals based on a likelihood ratio test.
I like biplot graphical displays, despite their subtleties regarding how rows and/or columns of a correlation or distance matrix are adjusted. In this paper, the author suugests “a weighted alternating least squares algorithm is used, with either a single scalar ad- justment, or a column-only adjustment with symmetric factorization; these choices form a compromise between the numerical accuracy of the approximation and the comprehensibility of the obtained correlation biplots.”
Are the Signs of Factor Loadings Arbitrary in Confirmatory Factor Analysis? Problems and Solutions was also mentioned in my daily arxiv digest. As I am slowly forgetting 15 years of previous work in psychometrics, I find it particularly hard to come back from time to time to such papers. Anyway, I’m happy to see that research continues in this field, especially for old questions like those. Finally, I came across this related work on PCA: Estimating the construct validity of Principal Components Analysis.
A potentially interesting paper I should read.
In this paper, we ask the question “Can pretraining help the lasso?”. We develop a framework for the lasso in which an overall model is fit to a large set of data, and then fine-tuned to a specific task on a smaller dataset. This latter dataset can be a subset of the original dataset, but does not need to be. We find that this framework has a wide variety of applications, including stratified models, multinomial targets, multi-response models, conditional average treatment estimation and even gradient boosting.
In the stratified model setting, the pretrained lasso pipeline estimates the coefficients common to all groups at the first stage, and then group-specific coefficients at the second “fine-tuning” stage. We show that under appropriate assumptions, the support recovery rate of the common coefficients is superior to that of the usual lasso trained only on individual groups. This separate identification of common and individual coefficients can also be useful for scientific understanding.
This analysis report presents an in-depth exploration of multiple hypothesis testing in the context of Genomics RNA-seq differential expression (DE) analysis, with a primary focus on techniques designed to control the false discovery rate (FDR). While RNA-seq has become a cornerstone in transcriptomic research, accurately detecting expression changes remains challenging due to the high-dimensional nature of the data. This report delves into the Benjamini-Hochberg (BH) procedure, Benjamini-Yekutieli (BY) approach, and Storey’s method, emphasizing their importance in addressing multiple testing issues and improving the reliability of results in large-scale genomic studies. We provide an overview of how these methods can be applied to control FDR while maintaining statistical power, and demonstrate their effectiveness through simulated data analysis.
The discussion highlights the significance of using adaptive methods like Storey’s q-value, particularly in high-dimensional datasets where traditional approaches may struggle. Results are presented through typical plots (e.g., Volcano, MA, PCA) and confusion matrices to visualize the impact of these techniques on gene discovery. The limitations section also touches on confounding factors like gene correlations and batch effects, which are often encountered in real-world data.
Ultimately, the analysis achieves a robust framework for handling multiple hypothesis comparisons, offering insights into how these methods can be used to interpret complex gene expression data while minimizing errors. The report encourages further validation and exploration of these techniques in future research.
An overview of different approach to control the false discovery rate
in multiple testing: Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY),
and Storey’s method. The BH method is already available in R’s
p.adjust()
. The BY method is just an extension of the BH
method that accounts for positive correlations among test statistics,
while Storey’s technique estimates the proportion of null hypotheses
\pi_0 among the tested CDS. Whether you
rely on a nominal \alpha level or q-value cutoffs, as in proteomics analyses,
all those methods work equally well. The author’s conclusion is as
follows:
Note that although this targets RNA-Seq analysis, where Multiple Hypothesis Testing is one of the many challenges raised by the big data omics era, this should apply to other contexts as well. It is worth noting that the authors included a simulation model for count data, using a Negative Binomial Model, which is what is usually found in pipeline like DESeq2.
I appreciated that the author provided a sidenote on statistical significance and biological relevance: “It is also crucial to recognize that statistical significance does not always imply biological importance, nor does a lack of significance disprove a biological hypothesis. In highly variable, small-sample contexts, meaningful changes may evade detection, reinforcing the need for well-powered experiments. Similarly, even a modest difference can be flagged as significant in extremely large datasets. These nuances highlight that DE detection should be interpreted within the broader scope of effect sizes, reproducibility, and biological validation.”
We propose sufficient conditions and computationally efficient procedures for false discovery rate control in multiple testing when the p-values are related by a known – meaning that we assume independence of p-values that are not within each other’s neighborhoods, but otherwise leave the dependence unspecified. Our methods’ rejection sets coincide with that of the Benjamini–Hochberg (BH) procedure whenever there are no edges between BH rejections, and we find in simulations and a genomics data example that their power approaches that of the BH procedure when there are few such edges, as is commonly the case. Because our methods ignore all hypotheses not in the BH rejection set, they are computationally efficient whenever that set is small. Our fastest method, the IndBH procedure, typically finishes within seconds even in simulations with up to one million hypotheses.
Yet another paper on controlling the false discovery rate. As mentioned above, when there’s some correlation between tested markers, modification of the default FDR correction (Benjamini–Hochberg) is in order. The trick the authors used in this work is to take into account a dependecy graph for p-values, allowing to assume independence of p-values that are not within each other’s neighborhoods, and otherwise leave the dependence unspecified.
Background and Objective: Variables collected over time, or longitudinally, such as biologic measurements in electronic health records data, are not simple to summarize with a single time-point, and thus can be more holistically conceptualized as trajectories over time. Cluster analysis with longitudinal data further allows for clinical representation of groups of subjects with similar trajectories and identification of unique characteristics, or phenotypes, that can be investigated as risk factors or disease outcomes. Some of the challenges in estimating these clustered trajectories lie in the handling of observations at inconsistent time intervals and the usability of algorithms across programming languages. Methods: We propose longitudinal trajectory clustering using a k-means algorithm with thin-plate regression splines, implemented across multiple platforms, the R package clustra and corresponding macros. The macros accommodate flexible clustering approaches, and also include visualization of the clusters, and silhouette plots for diagnostic evaluation of the appropriate cluster number. The R package, designed in parallel, has similar functionality, with additional multi-core processing and Rand-index-based diagnostics. Results: The package and macros achieve comparable results when applied to an example of simulated blood pressure measurements based on real data from Veterans Affairs Healthcare recipients who were initiated on anti-hypertensive medication. Conclusion: The R package clustra and the SAS macros integrate a K-means clustering algorithm for longitudinal trajectories that operates with large electronic health record data. The implementations provide comparable results in both platforms, satisfying the needs of investigators familiar with, or constrained by access to, one or the other platform.
It’s been years I’m no longer involved in healths analytics and patient-reported outcomes, but this one reminded me of a post on Cross Validated about clustering of longitudinal data. Since I’m barely reminescent of those days, I remember one answer of mine which took up my entire morning and into which I put a lot of effort and attention.
We present a novel tuning procedure for random forests (RFs) that improves the accuracy of estimated quantiles and produces valid, relatively narrow prediction intervals. While RFs are typically used to estimate mean responses (conditional on covariates), they can also be used to estimate quantiles by estimating the full distribution of the response. However, standard approaches for building RFs often result in excessively biased quantile estimates. To reduce this bias, our proposed tuning procedure minimizes “quantile coverage loss” (QCL), which we define as the estimated bias of the marginal quantile coverage probability estimate based on the out-of-bag sample. We adapt QCL tuning to handle censored data and demonstrate its use with random survival forests. We show that QCL tuning results in quantile estimates with more accurate coverage probabilities than those achieved using default parameter values or traditional tuning (using MSPE for uncensored data and C-index for censored data), while also reducing the estimated MSE of these coverage probabilities. We discuss how the superior performance of QCL tuning is linked to its alignment with the estimation goal. Finally, we explore the validity and width of prediction intervals created using this method.
Again, it’s been a long time since I ever ran an RF algorithm on real data points. It looks like the RF framework has been extended far and long, and even for quantile regression. In this regard, the out-of-bag sample is used to estimate the bias of the marginal quantile coverage probability estimate.
Markov-switching models are a powerful tool for modelling time series data that are driven by underlying latent states. As such, they are widely used in behavioural ecology, where discrete states can serve as proxies for behavioural modes and enable inference on latent behaviour driving e.g. observed movement. To understand drivers of behavioural changes, it is common to link model parameters to covariates. Over the last decade, nonparametric approaches have gained traction in this context to avoid unrealistic parametric assumptions. Nonetheless, existing methods are largely limited to univariate smooth functions of covariates, based on penalised splines, while real processes are typically complex requiring consideration of interaction effects. We address this gap by incorporating tensor-product interactions into Markov-switching models, enabling flexible modelling of multidimensional effects in a computationally efficient manner. Based on the extended Fellner-Schall method, we develop an efficient automatic smoothness selection procedure that is robust and scales well with the number of smooth functions in the model. The method builds on a random effects view of the spline coefficients and yields a recursive penalised likelihood procedure. As special cases, this general framework accommodates bivariate smoothing, function-valued random effects, and space-time interactions. We demonstrate its practical utility through three ecological case studies of an African elephant, common fruitflies, and Arctic muskoxen. The methodology is implemented in the LaMa R package, providing applied ecologists with an accessible and flexible tool for semiparametric inference in hidden-state models. The approach has the potential to drastically improve the level of detail in inference, allowing to fit HMMs with hundreds of parameters, 10-20 (potentially bivariate) smooths to thousands of observations.
This is a novel approach for semiparametric inference in hidden-state models, which relies on incorporating tensor-product interactions into Markov-switching models, also known as Hidden Markov Model, a mix of two stochastic processes (one observed and the other latent). It’s dense and I need to read this when I’ll have more time. Now, let’s go back to holidays, FWIW.
The study of associations and their causal explanations is a central research activity whose methodology varies tremendously across fields. Even within specialized subfields, comparisons across textbooks and journals reveals that the basics are subject to considerable variation and controversy. This variation is often obscured by the singular viewpoints presented within textbooks and journal guidelines, which may be deceptively written as if the norms they adopt are unchallenged. Furthermore, human limitations and the vastness within fields imply that no one can have expertise across all subfields and that interpretations will be severely constrained by the limitations of studies of human populations. The present chapter outlines an approach to statistical methods that attempts to recognize these problems from the start, rather than assume they are absent as in the claims of ‘statistical significance’ and ‘confidence’ ordinarily attached to statistical tests and interval estimates. It does so by grounding models and statistics in data description, and treating inferences from them as speculations based on assumptions that cannot be fully validated or checked using the analysis data.
This is a chapter from the 3rd edition of the Handbook of Epidemiology (Pigeot and Ahrens, eds.). It’s been a long time since I haven’t read papers or textbook on epidemiology, but I had this one on my bookshelf at some point in my life. This is all about the study of associations and causal relationships outside the field of randomized clinical trials, which are discussed upfront:
The methods thus provide logically sound inferences only in ideal cases, rather than correct answers for our actual imperfect applications. They are however widely applied and misinterpreted in studies that fall far short of their assumptions, as when actual participation and treatment are not random. These problems are only partially accounted for by discussions of study limitations; occasionally they are further analyzed with speculative models for the non- random (systematic) aspects of the causal processes generating the data.
But the simplifying assumptions are a recurring theme in the rest of the chapter, as well as that of a fictional world in which all assumptions hold. There is a very enlightning discussion on p-values, how they are often misinterpreted, and how they should be regarded in light of observed data (see pp.19 ff.). And basically, the whole paper is about statistical significance and how not to fool onto the trap of significant or non-signifcant p-values. Of note, Bayesian statistics are briefly discussed. I must admit I never encountered a single epidemiological paper using Bayesian inference.
The standard confidence interval for a population proportion covered in the overwhelming majority of introductory and intermediate statistics textbooks surprisingly remains the Wald confidence interval despite having a poor coverage probability, especially for small sample sizes or when the unknown population proportion is close to either 0 or 1. Using the mean coverage probability, and for some sample sizes, Agresti and Coull showed not only that the 95% Wilson confidence interval performs better, but also showed that 95% adjusted Wilson of type 4 confidence interval, obtained by simply adding four pseudo-observations, outperforms both the Wald and the Wilson confidence intervals. In this paper, we introduce a rainbow color code and pixel-color plots as ways to comprehensively compare the Wald, Wilson, and adjusted-Wilson of type \epsilon confidence intervals across all sample sizes n=1, 2, \dots, 1000, population proportion values p=0.01, 0.02, \dots, 0.99, and for the three typical confidence levels. We show not only that adding 3 (resp., 4 and 6) pseudo-observations is the best for the 90% (resp., 95% and 99%) adjusted Wilson confidence interval, but it also performs better than both the 90% (resp., 95% and 99%) Wald and Wilson confidence intervals.
While two-group comparisons with a continuous outcome are generally easily understood and carried out using a t or Wilcoxon test (although, in the latter case, with bad assumptions), estimating the precision of proportion estimates always rely on Wald confidence intervals, aka the “normal approximation”. In this case, the proprtion of success is estimated as \hat p \pm z_{\alpha}\sqrt{\frac{\hat p(1 - \hat p)}{n}}, where z_{\alpha} is the 1-\alpha/2 quantile of the standard normal distribution (z=1.96 for a 95% confidence interval). An alternative option is to rely on Wilson’s CI, and the well-known Agresti-Couli interval provides a very good approximation to it. You may recall that in the case of proportion, the standard error may be computed from the estimated proportion, \hat p, which yield Wald CIs, or from the proportion under H_0, p_0, which is the score test statistic (also called z-test in most statistical packages).7 The Wilson CIs are more tedious to compute (see Formula 2 in the paper and this blog post). The authors describe adjusted Wilson intervals for 90, 95, and 99% confidence level. The color scheme retained for these analyses remain, however, quite surprising.
This article explores combinations of weighted bootstraps, like the Bayesian bootstrap, with the bootstrap t method for setting approximate confidence intervals for the mean of a random variable in small samples. For this problem the usual bootstrap t has good coverage but provides intervals with long and highly variable lengths. Those intervals can have infinite length not just for tiny n, when the data have a discrete distribution. The BC_a bootstrap produces shorter intervals but tends to severely under-cover the mean. Bootstrapping the studentized mean with weights from a Beta(1/2,3/2) distribution is shown to attain second order accuracy. It never yields infinite length intervals and the mean square bootstrap t statistic is finite when there are at least three distinct values in the data, or two distinct values appearing at least three times each. In a range of small sample settings, the beta bootstrap t intervals have closer to nominal coverage than the BC_a and shorter length than the multinomial ootstrap t. The paper includes a lengthy discussion of the difficulties in constructing a utility function to evaluate nonparametric approximate confidence intervals.
For small samples and/or skewed data, the Bayesian t-test is a good option (see this unfinished post of mine). BCA correction for a bootstrap t-test usually undercovers the mean, and other problems appear when we consider the studentized rather than the arithmetic mean. In this paper the authors consider various techniques to overcome limitations arising from small samples and they show that the power law bootstrap t and the beta bootstrap t (decsribed in §2.5) provide good results, the latter being second order accurate.
See also, Hall, P. (1988). Theoretical comparisons of bootstrap confidence intervals. The Annals of Statistics, 16(3):927–953.
In microbiome studies, it is often of great interest to identify clusters or partitions of microbiome profiles within a study population and to characterize the distinctive attributes of each resulting microbial community. While raw counts or relative compositions are commonly used for such analysis, variations between clusters may be driven or distorted by subject-level covariates, reflecting underlying biological and clinical heterogeneity across individuals. Simultaneously detecting latent communities and identifying covariates that differentiate them can enhance our understanding of the microbiome and its association with health outcomes. To this end, we propose a Dirichlet-multinomial mixture regression (DMMR) model that enables joint clustering of microbiome profiles while accounting for covariates with either homogeneous or heterogeneous effects across clusters. A novel symmetric link function is introduced to facilitate covariate modeling through the compositional parameters. We develop efficient algorithms with convergence guarantees for parameter estimation and establish theoretical properties of the proposed estimators. Extensive simulation studies demonstrate the effectiveness of the method in clustering, feature selection, and heterogeneity detection. We illustrate the utility of DMMR through a comprehensive application to upper-airway microbiota data from a pediatric asthma study, uncovering distinct microbial subtypes and their associations with clinical characteristics.
I had to help a student two years ago for a microbial analysis, and she found an arcane R package on GitHub which relied on Spearman correlation of table of counts, IIRC, and allowed to build a network of co-occurring species, but this didn’t allow to take into account co-factors of interest. Unfortunately, the experimental design was flawed, so the analysis ended up early. It is interesting to see that covariates can be taken into account in the analysis of such large dataset .
Study designs classified as quasi- or natural experiments are typically accorded more face validity than observational study designs more broadly. However, there is ambiguity in the literature about what qualifies as a quasi-experiment. Here, we attempt to resolve this ambiguity by distinguishing two different ways of defining this term. One definition is based on identifying assumptions being uncontroversial, and the other is based on the ability to account for unobserved sources of confounding (under assumptions). We argue that only the former deserves an additional measure of credibility for reasons of design. We use the difference-in-differences approach to illustrate our discussion.
This papers provides a use case of difference-in-differences, and its relation with the parallel trends assumption, which usually holds if the exposure is randomly assigned. Random treatment allocation, however, does not mean we need to use a DID estimator, and post-treatment comparison should be preferred.
Tracking single fluorescent molecules has offered resolution into dynamic molecular processes at the single-molecule level. This perspective traces the evolution of single-molecule tracking, highlighting key developments across various methodological branches within fluorescence microscopy. We compare the strengths and limitations of each approach, ranging from conventional widefield offline tracking to real-time confocal tracking. In the final section, we explore emerging efforts to advance physics-inspired tracking techniques, a possibility for parallelization and artificial intelligence, and discuss challenges and opportunities they present toward achieving higher spatiotemporal resolution and greater computational and data efficiency in next-generation single-molecule studies.
For someone like me who deal with data coming from cellular biology, this provides an excellent overview of fluorescence microscopy.