Semagle

Optimization of F# implementation of SVM

The post “SVM Classification in F#” shows how fast is to implement SVM classification and Sequential Minimal Optimization (SMO) method in F#, but it doesn’t show how fast is the F# implementation. Unfortunately, the performance of that code is too small for practical applications. Apparently, there are intrinsic limitations of the .Net execution model and Mono virtual machine, which prevent to achieve a native code performance, and this overhead needs to be estimated. However, the main factor is the computational complexity of the implementation, and this problem can be solved.

SVM Classification in F#

Support Vector Machines (SVMs) is a very popular machine learning method for classification, regression, distribution estimation, etc. Exceptional feature of this method is an ability to handle objects of a diverse nature as soon as there is a suitable kernel function. Nonetheless, popular software libraries like LIBSVM 1 and SVMlight 2 are designed for vector data and it is hard to adopt them for other object types. The F# implementation seems to be promising in terms of readability and extensibility.

Random Numbers in F#

Many problems in engineering, finance and statistics can not be solved by direct methods, but a great number of them can be solved approximately using randomized algorithms. All those algorithms need flexible and efficient pseudo-random number generators. An effective implementation of PRNG in the F# language is somewhat tricky.

Summary statistics in F#

Summary statistics are commonly used to build a simple quantitative description of a set of observations. Simple descriptions include mean, variance, skewness and kurtosis, which are quantitative measures of location, spread and shape of the data distribution. However, straightforward implementations of these measures in F# do not scale to large amounts of data. There are more sophisticated methods, but imperative implementations of those methods use mutable variables. Nonetheless, mathematical definitions of those methods allow to build effective functional implementation using higher-order functions in F#.

Data Sources in F#

There are three popular data formats CSV (Comma Separated Values), JSON (JavaScript Object Notation) and XML (Extensible Markup Language), which are very frequently used in data science. F# Data library (FSharp.Data) implements almost everything you need to access data stored in CSV, JSON and XML formats. Moreover, FSharp.Data implements F# type providers that infer the record structure from a sample document and, thus, allow to check the record structure at the compile time.