A few years ago we discussed the Welford algorithm to compute the variance of a vector of numerical values without catastrophic cancellation. In addition to being numerically stable, the difference with Kahan-compensated two-pass computations for large data sets are very small. See also Numerically stable algorithm for computing the running mean for related approaches, and John D. Cook’s original post. The stability and numerical accuracy of one-pass algorithms are also discussed in several textbooks.1
While reviewing Newlisp statistical function in the stat module, I noticed it should get it wrong in this case:
(module "stat.lsp")
(set 'x (sequence 1 5))
(stat:var x)
;; => 2.5
(stat:var (map (lambda (x) (+ 1e10 x)) x))
;; => -16384
The author uses the formula (sub (sum-sq X) (div (mul (sum X) (sum X)) (length X)))
(later corrected for bias), which is the same that was discussed in the previous post.
Note that Racket has no problem with this cornercase.2 Let’s implement the single-pass Welford algorithm in Newlisp, as a rough translation of the R code provided in the previous post:
(define (variance xs)
(let ((mean 0.0)
(variance 0.0)
(n (length xs)))
(dotimes (i n)
(setq curr (nth i xs))
(setq tmp mean)
(setq mean (add mean (div (sub curr mean) (+ i 1))))
(setq variance (add variance (mul (sub curr mean) (sub curr tmp)))))
(div variance (- n 1))))
(variance (map (lambda (x) (+ 1e10 x)) x))
;; => 2.5
A more idiomatic way of writing a standard function in CL, which follows Racket’s two-pass approach, would be:
(defun mean (xs)
(/ (reduce #'+ xs) (length xs)))
(defun variance (xs)
(let ((mean (mean xs))
(n (length xs))
(/ (reduce #'+ (map 'list #'(lambda (x) (square (- mean x))) xs))
(1- n))))
I should not that Tom MacWright gets it right too in his simple-statistics module:
const ss = require("../dist/simple-statistics.js");
ss.sampleVariance([1, 2, 3, 4, 5].map((x) => x + 1e10));
The take away message is that you should either rely on a two-pass algorithm (i.e., compute the mean first, then the squared deviations from the mean) or the Welford algorithm, especially if you are interested in running statistics.
[2023-11-23]
As I was surveying the various statistical packages lying around on GitHub, I found many buggy implementations (e.g., basic-statistics.lisp
in LispUtils). However, cl-ana got it right:
(ql:quickload "cl-ana")
(defparameter *xs* '(1 2 3 4 5))
(cl-ana::variance *xs*)
;; => 5/2
(cl-ana::variance (mapcar (lambda (x) (+ 10000000 x)) *xs*))
;; => 5/2
♪ Tracy Chapman • Fast Car
See, e.g., N.J. Higham, Accuracy and Stability of Numerical Algorithms (2nd ed.), SIAM 2002. (section 1.9) ↩︎
Here’s the test case:
(require math)
(define x '(1 2 3 4 5))
(exact->inexact (variance x #:bias #t))
;; => 2.5
(exact->inexact (variance (map (lambda (x) (+ 1e10 x)) x) #:bias #t))
;; => 2.5