Summary statistics are commonly used to build a simple quantitative description of a set of observations. Simple descriptions include mean, variance, skewness and kurtosis, which are quantitative measures of location, spread and shape of the data distribution. However, straightforward implementations of these measures in F# do not scale to large amounts of data. There are more sophisticated methods, but imperative implementations of those methods use mutable variables. Nonetheless, mathematical definitions of those methods allow to build effective functional implementation using higher-order functions in F#.

# Simple Mean and Variance

The F# implementation of mean $\bar{x}=\sum_{i=1}^N x_i$ for a sequence of any length is simple and needs only one pass for processing of the entire sequence:

However, the straightforward implementation in F# of variance $s^2=\frac{\sum_{i=1}^N (x_i-\bar{x})^2}{N-1}$ requires first to compute mean and then compute variance, i.e. the sequence will be iterated two times:

There is another way to compute mean $\bar{x}=\sum_{i=1}^N x_i$ and variance $s^2=\frac{\sum_{i=1}^N x_i^2 - (\sum_{i=1}^N x_i)^2/N}{N-1}$ in one pass, but in practice the precision of the computation will be much less than the inherent precision of floating-point operations in general. Even above mentioned naive implementation is numerically stable only if $N$ is small.

There is a numerically stable algorithm for the sample variance and mean that is based on Welford’s recurrent equation 1:

Through the Python implementation or any other imperative implementation is using mutable variabes for accumulating $M_k$ and $M_{2,k}$

It is possible to reproduce this computation in F#, but it would be both “ugly” and “twisting” the mathematical definition. Moreover, the F# impementation of $M_k$ and $M_{2,k}$ recurrent equation is quite simple:

The function M is intentionally using tuple as the first argument, which structure is similar to the returned result. Using this definition it is easy to compute mean and variance with Seq.fold.

# Correctness of Implementation

The implementation of the online mean and variance computation is relatively simple. Though, it is also easy to make a mistake and use M_k_1 instead of M_k in M_2k_1 + delta * (x - M_k). The first thing that comes in mind in order to avoid bugs is to write a unit test 2:

Nonetheless, many questions arise:

How many unit tests are needed to prove that the implementation is correct? 1, 2, 10, 100, 1000, … There is no answer to this question and, fortunately, it is not needed.

There is another question that have the definite answer:

What properties need to be checked in order to be sure that the implementation is correct? The answer is simple enough - mean and variance computed by the online algoritm should be equal to mean and variance computed by naive implementation of mean and variance for any valid input 3:

Valid inputs are sequences of normal float numbers with a limited variance:

and a quick check shows that the required property holds for 100 randomly generated valid inputs:

A brief analysis of the function M shows that is has bigger value than the simple computation of mean and variance for the sequence, e.g.:

• First, it allows to compute both the variance $\sigma^2$ of the population of a total size $N$ and the unbiased estimate $s^2$ for a finite number $n$ of samples:
• Second, it can be applied to any “foldable” type:
• Third, it can be easily applied to multiple input values:

# Conclusions

The F# language has advanced language features and additional tools that greatly simplify implementation and sanity checks for the summary statistics algorithms. Moreover, the resulting code is extremely flexible and can be easily reused for solving other kinds of problems.

1. B. P. Welford Note on a method for calculating corrected sums of squares and products // Technometrics. - 1962. - 4(3) - pp. 419–420.