DATA SOURCES IN F#

There are three popular data formats CSV (Comma Separated Values), JSON (JavaScript Object Notation) and XML (Extensible Markup Language), which are very frequently used in data science. F# Data library (FSharp.Data) implements almost everything you need to access data stored in CSV, JSON and XML formats. Moreover, FSharp.Data implements F# type providers that infer the record structure from a sample document and, thus, allow to check the record structure at the compile time.

CSV Files

For reading/writing of CSV files FSharp.Data package implements CsvProvider. This provider can be initialized either by passing the sample CSV parameter. Sample parameter value is a string or a file that contains the CSV sample or the samples list:

type Teams = CsvProvider<"baseball.csv">

or by passing the header and the schema parameters:

type Teams = CsvProvider<"""Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,
                            RankSeason,RankPlayoffs,G,OOBP,OSLG""",
                          Schema="""string,string,int,int,int,float,float,float,
			            float,int,int?,int?,float,float,float""">

Initialization by sample data is perfect for fast prototyping, but for the distribution of the resulting library or executable it better to specify header and schema parameters, or to use embedded resource:

EmbeddedResource="baseball.csv"

Teams is an instance of generic CSVProvider type that was constructed from sample data. For reading CSV file static method Load needs to be called. The result of Load method is a Team provider type instance:

let teams = Teams.Load(Environment.CurrentDirectory + "/baseball.csv")

Instance teams has property Rows that returns a sequence of CSV file records. Records have type Teams.Row that exposes CSV file fields, which names and types checked at compile time. E.g., the following fragment for selecting teams that made to playoff will generate a compile time error “This expression was expected to have type int, but here has type string”:

let playoffs = teams.Rows |> Seq.filter (fun r -> r.Playoffs = "1")
for r in playoffs do
  printfn "%s" r.Team

Instance teams has methods Filter and Map that allow some basic transformations of CSV records and return a modified CSVProvider instance, but property Rows has type Seq<Teams.Row> that is a dynamic sequence of elements. Using Seq module functions it can be transformed and analysed in a multitude of ways:

let yearOf (r : Baseball.Row) = r.Year

printfn "%d" (baseball.Rows |> Seq.distinctBy yearOf |> Seq.length)

for (year, rs) in (playoffs |> Seq.groupBy yearOf) do
  printfn "%d %d" year (rs |> Seq.length)

JSON Files

For reading JSON files FSharp.Data package implements JSONProvider. This provider is initialized by passing the sample JSON parameter. Sample parameter value is a string or a file that contains the JSON sample or the samples list:

type Businesses = JsonProvider<"yelp_business.json", SampleIsList=true>

JSONProvider does not support JSON schema definitions, but samples can be embedded as resources in the resulting library or executable for distribution:

EmbeddedResource="yelp_business.json"

Unfortunately, JSONProvider does not support new-line delimited JSON file format (JSON Lines), but it is easy to convert the multi-line file to the JSON object sequence using Seq.map:

open System.IO
...
let businesses = File.ReadLines("yelp_training_set/yelp_training_set_business.json")
                   |> Seq.map Businesses.Parse

Now, analysis of JSON data becomes easy and type-safe:

for (state, bs) in (businesses |> Seq.groupBy (fun b -> b.State)) do
  printfn "%s %d" state (bs |> Seq.length)

XML Files

For reading of XML files FSharp.Data package implements XmlProvider. This provider can be initialized either by passing the sample XML parameter. Sample parameter value is a string or a file that contains the CSV sample or the samples list:

type News = XmlProvider<"""<?xml version="1.0" encoding="iso-8859-1"?>
<newsitem itemid="2286\" id="root" date="1996-08-20" xml:lang="en">
    <title>Sample Title 1</title>
    <headline>Sample Headline 1</headline>
    <byline>Sample Author 1</byline>
    <dateline>Sample Date Line 1</dateline>
    <text>
        <p>Sample Text 1</p>
        <p>Sample Text 2</p>
    </text>
</newsitem>""">

Obviously XmlProvider is designed to load or parse only one XML document, but with sequence expressions it is easy to build a sequence of lazily parsed XML trees:

let news = seq {
  for file in Directory.EnumerateFiles(".", "*.zip") do
    use zip = ZipFile.OpenRead(file)
    for entry in zip.Entries do
      use stream = new StreamReader(entry.Open())
        yield News.Parse(stream.ReadToEnd()) }

Unlike DOM trees, typed XML trees generated by XMLProvider are easy to process:

for n in news do
  printf "%s\n" n.Headline

Conclusions

Unlike most statically typed languages F # allows a quick start of data analysis without need to create domain objects or tedious work with DOM-like structures. At the same time type providers generate strongly typed structures, which allow to minimize number of silly mistakes that are usually seen in dynamically typed languages.

Data Sources in F#

CSV Files

JSON Files

XML Files

Conclusions