Summary statistics are commonly used to build a simple quantitative description of a set of observations. Simple descriptions include mean, variance, skewness and kurtosis, which are quantitative measures of location, spread and shape of the data distribution. However, straightforward implementations of these measures in F# do not scale to large amounts of data. There are more sophisticated methods, but imperative implementations of those methods use mutable variables. Nonetheless, mathematical definitions of those methods allow to build effective functional implementation using higherorder functions in F#.
Simple Mean and Variance
The F# implementation of mean \(\bar{x}=\sum_{i=1}^N x_i\) for a sequence of any length is simple and needs only one pass for processing of the entire sequence:
However, the straightforward implementation in F# of variance $s^2=\frac{\sum_{i=1}^N (x_i\bar{x})^2}{N1}$ requires first to compute mean and then compute variance, i.e. the sequence will be iterated two times:
There is another way to compute mean \(\bar{x}=\sum_{i=1}^N x_i\) and variance \(s^2=\frac{\sum_{i=1}^N x_i^2  (\sum_{i=1}^N x_i)^2/N}{N1}\) in one pass, but in practice the precision of the computation will be much less than the inherent precision of floatingpoint operations in general. Even above mentioned naive implementation is numerically stable only if \(N\) is small.
Advanced Implementation
There is a numerically stable algorithm for the sample variance and mean that is based on Welford’s recurrent equation ^{1}:
Through the Python implementation or any other imperative implementation is using mutable variabes for accumulating \(M_k\) and \(M_{2,k}\)
It is possible to reproduce this computation in F#, but it would be both “ugly” and “twisting” the mathematical definition. Moreover, the F# impementation of \(M_k\) and \(M_{2,k}\) recurrent equation is quite simple:
The function M
is intentionally using tuple as the first argument, which structure is
similar to the returned result. Using this definition it is easy to compute mean and variance
with Seq.fold
.
Correctness of Implementation
The implementation of the online mean and variance computation is relatively simple. Though,
it is also easy to make a mistake and use M_k_1
instead of M_k
in M_2k_1 + delta * (x  M_k)
.
The first thing that comes in mind in order to avoid bugs is to write a unit test ^{2}:
Nonetheless, many questions arise:
How many unit tests are needed to prove that the implementation is correct? 1, 2, 10, 100, 1000, … There is no answer to this question and, fortunately, it is not needed.
There is another question that have the definite answer:
What properties need to be checked in order to be sure that the implementation is correct? The answer is simple enough  mean and variance computed by the online algoritm should be equal to mean and variance computed by naive implementation of mean and variance for any valid input ^{3}:
Valid inputs are sequences of normal float numbers with a limited variance:
and a quick check shows that the required property holds for 100 randomly generated valid inputs:
Advanced Applications
A brief analysis of the function M
shows that is has bigger value than the simple computation of
mean and variance for the sequence, e.g.:
 First, it allows to compute both the variance \(\sigma^2\) of the population of a total size \(N\) and the unbiased estimate \(s^2\) for a finite number \(n\) of samples:
 Second, it can be applied to any “foldable” type:
 Third, it can be easily applied to multiple input values:
Conclusions
The F# language has advanced language features and additional tools that greatly simplify implementation and sanity checks for the summary statistics algorithms. Moreover, the resulting code is extremely flexible and can be easily reused for solving other kinds of problems.

B. P. Welford Note on a method for calculating corrected sums of squares and products // Technometrics.  1962.  4(3)  pp. 419–420. ↩

FsCheck https://fscheck.github.io/FsCheck ↩