OCaml-vega-lite and OCaml dataframes

Hi everyone,

I’ve been working on ocaml-vega-lite, an OCaml representation of Vega-Lite. My priorities for this library are for it to be complete and safe, in the sense that hitting any representable value with to_yojson will yield a syntactically valid Vega-Lite JSON spec. To the extent possible, I’d also like it to be convenient and ergonomic.

I’m planning on supplementing this relatively low-level library with a higher-level one that supports interactive data exploration from environments like utop and IOCaml.

I have mixed feelings about ocaml-vega-lite as it currently stands, and I would love to get some feedback from the list. On the positive side, I feel that ocaml-vega-lite on track to becoming complete and safe. It’s also usably ergonomic for generating visualizations in programs. The LOC counts for the OCaml in examples/ are comparable to those of the corresponding JSON in test/. The library’s types are relatively easy to discover in utop.

On the negative side, ocaml-vega-lite is still way too heavy to use for interactive data exploration. Compare the bar chart example:

open VegaLite.V2

type row = {
  a : string;
  b : int
} [@@deriving yojson]

let dataValues = [
    {a = "A"; b = 28}; {a = "B"; b = 55}; {a = "C"; b = 43};
    {a = "D"; b = 91}; {a = "E"; b = 81}; {a = "F"; b = 53};
    {a = "G"; b = 19}; {a = "H"; b = 87}; {a = "I"; b = 52}
  ]

let dat = `InlineData InlineData.{
    values = `JSONs (List.map row_to_yojson dataValues);
    format = None
  }

let enc : Encoding.t =
  let xf = PositionFieldDef.(make `Ordinal |> field (`String "a")) in
  let yf = PositionFieldDef.(make `Quantitative |> field (`String "b")) in
  Encoding.(make () |> x (`Field xf) |> y (`Field yf))

let jsonSpec = CompositeUnitSpec.(make (`Mark `Bar)
  |> description "A simple bar chart with embedded data."
  |> data dat
  |> encoding enc
  |> to_yojson)

to the concision of Scala’s Vegas:

Vegas("A simple bar chart with embedded data.").
  withData(Seq(
    Map("a" -> "A", "b" -> 28), Map("a" -> "B", "b" -> 55), Map("a" -> "C", "b" -> 43),
    Map("a" -> "D", "b" -> 91), Map("a" -> "E", "b" -> 81), Map("a" -> "F", "b" -> 53),
    Map("a" -> "G", "b" -> 19), Map("a" -> "H", "b" -> 87), Map("a" -> "I", "b" -> 52)
  )).
  encodeX("a", Ordinal).
  encodeY("b", Quantitative).
  mark(Bar).
  show

I can add a convenience layer in the higher-level library I mentioned; but right now that layer has a wide ergonomics gap to cover, which will mean lots of code and probably lots of ad-hoc decisions on my part. I’d love to find a way get ocaml-vega-lite to be a bit lighter and more ergonomic without sacrificing completeness or safety.

To illustrate the challenges I’ve encountered, consider the “x” field of Encoding.t. In pseudocode, the possible shapes of this value are:

| `Field of (PositionFieldDef.t = {
    type_: Type.t;
    timeUnit: TimeUnit.t option;
    stack: StackOffset.t option;
    sort: [ `Field of VegaLite.V2.SortField.t | `Order of VegaLite.V2.SortOrder.t ]
    scale: Scale.t option;
    field: [ | `Repeat of string | `String of string] option;
    bin : [ `Bool of bool | `Params of VegaLite.V2.BinParams.t ] option;
    axis : VegaLite.V2.Axis.t option;
    aggregate : [| `Mean | `Median | `Stdev | ... ] option;
  })
| `Value of [| `Bool of bool | `Float of float | `String of string | `Int of int]

Currently, ocaml-vega-lite makes you think about a lot of details when constructing this type:

enc |> x (`Field PositionFieldDef.(make `Ordinal |> field (`String "a")))

By contrast, Vegas lets you say encodeX("a", Ordinal). If I understand correctly, there are two pieces that make this concision possible:

  1. Ad-hoc polymorphism. If you said encodeX(1.0), Vegas would know that the 1.0 means (`Value (`Float 1.0)). This isn’t possible in OCaml, but in theory one could have something like encodeX(`String ("a", `Ordinal)) and encodeX(`Float 1.0).

  2. A little bit of opinionation on the part of Vegas. In encodeX("a"), the "a" could mean

    • In the Field variant of PositionFieldDef,
      • Either the Repeat or String variant of field
      • scale.scheme or scale.range
      • axis.title, axis.titleAlign or axis.format
    • The Value variant of PositionFieldDef

    It looks like Vegas decides that, since “a” is a column in the data, it should be interpreted as the String variant of field. This type of decision would arguably be more at home in the higher-level library.

Any thoughts on how to lighten ocaml-vega-lite up a bit, or on anything else about the library, would be welcome.

Cheers,
Anand

7 Likes

To follow up, after much trial and error I’ve found a way to get concision comparable to Vegas’: https://github.com/apatil/aplomb. The trick was to derive a few functions that “soak up” the required polymorphism and evaluate to generic Vega-Lite types.

As a bonus, that layer adds a little bit of safety as well as concision. It encourages you to only use field references that are present in the dataset you’re visualizing.

The tradeoff is that you have to write a little bit of setup code before you can start plotting, but I don’t think that will detract much from the interactive experience. In utop, the setup could be handled by a script that you pass to -init. From that point, once you get used to Vega-Lite the interactive experience should feel pretty light.

hi, i looked at aplomb on github. i have not used vega-lite before but i do have some experience plotting tables with pandas. in typing scheme of aplomb it seems that you have to do some setup for each type of table you want to plot. e.g. if a new column is added you need a new plotting module – this still feels quite heavy to me. in my experience it happens now and then that i want to plot various differently-dimensioned tables and compare the plots – this would require several plotting modules in your scheme.

not sure i have a better idea. one thing might be to have a type for the entries instead of the whole row at once. i.e. a variant type like [`Ordinal of int | `Quantitative of Float] etc.

Thanks, that is great feedback. In hindsight, there’s no reason why the setup stuff can’t be shortened to

type row = {... } [@@deriving aplomb]

Also, if someone wants to bring their own data frame type that isn’t just a list of records, I can document how to use the Make() functor. You or the author of the data frame type would have to write 3-4 functions in a module to input into the functor.

For a data frame type that’s dynamic enough to allow you to add and remove columns ad hoc, the author of the data frame type could do that in a library and then you wouldn’t have to do any setup at all.

That opens up another interesting conversation: how do you typically represent data frames in OCaml? I have been thinking that they should be like optimized lists of records, where the column names and types are known at compile time. You pointed out that it would be nice for them to support dynamic shape manipulations like adding columns, eg they could be like optimized maps from string to [ Int of int list | Float of float list | String of string list | ... ].

The benefit of the more dynamic behavior would be that you wouldn’t have to set up all your types ahead of time and that we wouldn’t have to give up as much of the dynamic functionality of R and Python data frames. The cost would be that you’d have to keep checking whether a column exists & has the expected type throughout your code.

My instinct is that the cost outweighs the benefit. I like the idea of checking for errors when reading a dataset, then knowing that the rest of my code will not reference a nonexistent column or make bad assumptions about data types.

However, when checking out a new dataset for the first time it is nice to be able to explore it interactively without first having to somehow scan the column headers & types, then write a row type. Maybe we need both the static and dynamic data frame varieties, or maybe we need derivers that generate static row types from serialized data formats like CSV files, feather and so forth.

I haven’t had a chance to look at your library in detail yet, but it looks perfect for generating logs of build status from the OCaml Build Infrastructure

1 Like

That would be sweet! @n4323 raised an issue that can and should be addressed, I’ll hopefully get to it this weekend & update the readme, then it would be great to get your feedback. I’d also be very interested in hearing your thoughts on the right way(s) to represent dataframes in OCaml.

i guess what’s best depends on the use case. if you have millions of records in a huge dataframe, which is a given dataset you’re working with, i can see that static typing could be the fastest solution. my use cases were multiple transient small dataframes of various shapes used simultaneously e.g. to plot data. code generation with a ppx might be able to do both, as you mention.

i think ocaml needs good dataframe support in general – again, the static vs dynamic typing issue will come up. for general ideas on dataframes, maybe owl is another place to look, i think @ryanrhymes is considering adding dataset support.

1 Like

I indeed planned introducing dataframe to Owl in July. Then I basically encountered most of the issues you have discussed here, more challenging than I expected. Later the work on Owl’s core and Zoo system distracted me from this, also I need more time in thinking about the various trade-offs in implementation. But it’s really great to see the discussion started :slight_smile:

i think this question could require careful design considerations. my pandas experience is that it’s powerful but its complex and sometimes cryptic API feels like learning a whole new language. basically whenever i need to do anything in pandas i google on stackoverflow because it’s quite hard to deduce the right way to do it or to even parse the huge documentation.

pandas dataframes are very flexible: e.g. they can be transposed, so that columns and rows interchange their meaning. they also allow hierarchical row and column indexing, and have functions for flattening these indices etc – a lot of operations change the structure of the table. all of this seems incompatible with static row types?

but i’m not sure if all this flexibility is necessarily a good thing for all use cases. there is some appeal to the simplicity of a ‘record’ in a table being an actual record in the language.

another point of reference could be hdf5 https://support.hdfgroup.org/HDF5/doc1.6/UG/11_Datatypes.html. within a hdf5 file, one has a hierarchical filesystem-like structure with datasets stored at the tree nodes. the datasets are either flat extensible arrays or multidimensional fixed-size arrays; the array elements can have user-defined structured types, e.g. records. these are stored within the hdf5 file so that the file is self-describing.

iiuc, a putative ocaml dataframe would correspond more or less to a flat-array hdf5 dataset with a record datatype.

(I was hoping to get Aplomb to the point of demo-ability this weekend but it’s taking a bit longer, stay tuned.)

@n4323 and @ryanrhymes, thanks for moving the dataframe discussion forward, I’m also really happy to see it happening. Pinging @rizo too as I gather from his github page that he has given some thought to dataframes in OCaml.

@n4323 to your point about HDF5, I feel that the right backend is Arrow https://arrow.apache.org/docs/memory_layout.html. It was designed by the authors of Pandas and Dplyr to be a fast memory representation of data frames that’s exchangeable between languages. I have used hdf5 and it’s great, but arrow has been designed for R/Pandas-like dataframes specifically and would maximize interoperability with libraries built around those dataframes in other languages. We could still put a variety of OCaml interfaces on top of either arrow or hdf5.

To open up the discussion about the interface a bit, there are dataframe representations in other strongly typed, functional languages that we can use for inspiration:

And of course there are many great ideas in dataframe interfaces from dynamically typed languages:

I haven’t had the chance to fully digest all of that but I feel that the big decision to be made in OCaml is how the dataframe should trade off safety, simplicity and flexibility.

With a very flexible type like

type column = [`Int of int array | `Float of float array | `Bool of bool array | IntOpt of int option array | ...]
type dataframe = (string * column) array

we could duplicate most functionality from the R and Pandas-related packages above. However, as @n4323 said too much flexibility is not always helpful. This type also happens to be syntactially heavy in OCaml because there’s no sugar for Map.t literals. A row would look like this:

let row = [| ("x", `Int 0); ("y", `Float 3.2) |]

Also, with this flexible dataframe type you’d have to add “column-exists” and “column-has-type” checks throughout your code. you’d probably end up saving some keystrokes by using partial pattern matches that may throw exceptions, eg

let y_23 : float = frame 
  |> columnGet "y" (* May throw an exception *)
  |> function
    | `Float a -> get 23
    (* Partial pattern match, any other column type will result in an excetption *)

These exceptions would for all intents and purposes be runtime type errors. That’s a bummer; I’m here talking about implementing a dataframe in OCaml rather than using an existing one in R or Python because there have been so many production bugs in my life could have been prevented by OCaml’s type system.

As we’ve started to discuss, stronger dataframe types introduce their own problems. To make these explicit, consider the case where a dataframe is an array of records:

type row = {x : int; y : float}
type 'row dataframe = 'row array

This is a pretty darn safe type. I can’t compile code that tries to reference a nonexistent column or that tries to get ints out of a float-valued column. I can’t make a ragged dataframe, all the columns have to have the same length. I can try to get a row index that exceeds the length of the dataframe, but I already have that problem with lists and arrays.

Now, let’s see how we’d implement a subset of the dplyr api https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.

  1. Reading in data from a CSV file: We’d have to either manually create the dataframe type in our code, or ppx it directly from the CSV. If we created it manually, we could check it against the CSV file at compile time, similarly to how PGOcaml can check types against Postgres databases at compile time.

  2. filter and arrange: No problem. Any row-based operation is pretty easy. The types for these operations would be:

    filter : 'row datafarme -> ('row -> bool) -> 'row dataframe
    arrange : 'row dataframe -> ('row -> 'row -> int) -> 'row dataframe
    
  3. select: This is getting harder. The select operation takes a dataframe and a column selector and returns a new dataframe containing only the columns in the selector. Say we ppx’ed a type for column references:

    type colRef = [ `x | `y ]
    

    and we want ‘column selector’ to mean list of column references. The type would be:

    select : 'row dataframe -> colRef list -> 'otherRow dataframe
    

    where 'otherRow is a new type that actually depends on what’s in the colRef list. We can’t even ppx all of these types ahead of time because there are 2^n of them in the number of columns.

    So our ‘select’ has to be either:

    • PPX that generates the type 'otherRow and also generates a function from 'row dataframe to 'otherRow dataframe.
    • Some type-level magic that I don’t know about
    • A composite operation where we ‘melt’ a strongly typed dataframe into a more dynamic one, subset the columns of that, and then ‘freeze’ it into another strongly typed dataframe, and check errors during the freeze. We’d have to manually create the type of the new dataframe.

    Select needs to feel light and intuitive in any case, as it is a common operation.

  4. Transmute: Not hard. Its type is

    transmute : 'row dataframe -> ('row -> 'otherRow) -> 'otherRow dataframe
    

    so it’s just map for dataframes.

  5. Mutate: This is hard. To distil the difficulty, consider a simpler function that horizontally concatenates two dataframes:

    hconcat : 'row dataframe -> 'otherRow dataframe -> ('row + 'otherRow) dataFrame
    

    where what I mean by ('row + 'otherRow) is a record type that includes all the fields in 'row and also all the fields in 'otherRow. I don’t know of a way to represent such types in OCaml aside from manually entering them, which would feel too heavy for an intuitively simple operation like hconcat. As with select, we need either PPX, some kind of language extension, or a melt/freeze.

  6. Summarise: We can get close enough by implementing reduce/fold for data frames.

  7. sample_n and sample_frac: No problem.

  8. group_by: It’s not as clear how this would fit into an OCaml dataframe API, let’s defer it until we understand the simpler operations.

I don’t have ready answers for how we would tackle the hard operations, column subset & horizontal concat. The Haskell package, http://acowley.github.io/Frames/ has an interesting-looking take on column subsetting but it uses fancy type system extensions.

Looking forward to continuing this discussion.

Anand

1 Like

It is perfectly possible to implement list concatenation at the list level using GADTs and few tricks. For instance, consider that I am interested in having a specification of data frame as an heterogeneous list of column types;

type _ elt =
  | Int: int elt
  | Float: float elt

module L = struct
  type ('list,'tail) list =
    | (::) : 'a elt * ('b,'tail) list -> ('a * 'b, 'tail) list
    | []: ('elt,'elt) list
end open L
let data = [Int;Float]

Then the tail type can be used to implement the list concatenation

let rec (@): type a b c. (a,b) list -> (b,c) list -> (a,c) list =
  fun x y -> match x with
    | a :: q -> a :: (q @ y)
    | [] -> y

let data2 = [Float] @ data

Similarly, one can select a specific index from a type list:

type (_ , _) index =
  | Z: ('a * 'b , 'a) index
  | S: ('s,'r) index -> ( _ * 's, 'r) index

let rec take: type input t output. (input, output) index 
  -> (input, unit) list 
  -> output elt =
  fun n l -> match n, l with
    | _ , [] -> .
    | Z,  a :: _ -> (a:output elt)
    | S k, _ :: q -> take k q

let Float = take Z data2

Going one step further and constructing a list by indexing a list of indices is perfectly possible, solving 3:

module Filter = struct
  type ('input,'output,'tail) f =
    | [] : ('input,'output,'output) f
    | (::): ('input,'output) index * ('input, 'a, 'tail) f 
  -> ('input, 'output * 'a, 'tail) f
end

let rec filter: type indices t list. (list,indices,t) Filter.f 
-> (list,unit) L.list -> (indices,t) L.list =
  fun indices list -> match indices with
    | Filter.[] -> []
    | Filter.( a :: q ) -> L.( (take a list) :: filter q list )

For instance,

 let [Float;Float] = filter Filter.[Z; S (S Z)] data2

This can be combined with a dynamic version of the data frame representation with
a dynamic type type dyn = Dyn: ('a,'b) L.list [@@unboxed]. The main advantage of this approach is that it is then perfectly possible to read the data frame from a file; and preserves the ability to safely cast to a statically typed version with

let rec cast: type a b. (a,b) L.list -> dyn -> (a,b) L.list option =
  fun spec (Dyn d) -> match spec, d with
    | [], [] -> Some []
    | Float :: q, Float :: q' ->
      begin
        match cast q (Dyn q') with | None -> None | Some t -> Some(Float::t)
      end
    | Int :: q, Int :: q' ->
      begin
        match cast q (Dyn q') with | None -> None | Some t -> Some(Int::t)
      end
    | _ -> None
1 Like

@octachron Wow, I had no idea OCaml supported types like these! This looks really promising.

The dynamic type

type dyn = Dyn: ('a,'b) L.list [@@unboxed]

doesn’t compile for me, with or without the [@@unboxed] property, on 4.04.0. Which version of OCaml are you on?

I merely forgot -> dyn when copying this type:

type dyn = Dyn: ('a,'b) L.list -> dyn [@@unboxed]

You might want to look at the slap library, which is a API for blas/lapack with statically typed sizes. It’s not exactly what you want, but it present how you can use GADTs in practice for this kind of work, as @octachron showed.

1 Like

You might also be interested in looking at xarray. In my experience, it offers a nicer API than plain pandas.

1 Like

I didn’t know about that, it looks very cool. Language-agnostic, performance-optimized sharing of nested structured data. Yay. It seems to fit a similar goal to the HDF5 Dataset specification with the added property to be tailored to the CPU and streaming. It seems that conversion to HDF5 is not hard either.

Didn’t know that either. This architecture: https://xray.readthedocs.io/en/stable/why-xarray.html#core-data-structures makes a lot of sense to me as a more general version of a data frame.

The xarray model could potentially fit in with a typed-row approach: A 1D array of (glorified) row records would then correspond to a pandas Series, and an N-D array (with the addition of dimension labels) of row records would correspond to an xarray Dataset.

On first sight, this looks appealing. However it’s unclear to me if this generality is just plain better or if it makes sense to consider series data or 2D tables separately. Also, obviously I don’t know if and how the typed-row approach can be made general and usable enough using GADT tricks.

Hi everyone,

Thanks @drup, @theindigamer and @n4323 for your insights.

@n4323 I’m also not clear on what tradeoffs ndarray columns would require. Arrow does seem to support ndarray columns: https://github.com/apache/arrow/blob/master/format/Tensor.fbs .

Re: the usability of a dataframe based on GADT tricks, I’m still digesting @octachron’s example and trying to come up with a strawman dataframe that we can use to think about the user experience. Will follow up on the that conversation when I’m able to make some progress.

In the interest of finishing one thing at a time, I’ve gotten Aplomb to the point that it’s worthwhile to seek more feedback from anyone who has the time and interest. @avsm, I’d love to know how well or poorly it would meet your plotting needs for the build infrastructure. Note that it depends on two unpublished packages:

I’ve incorporated @n4323’s previous feedback about the heaviness of the setup by:

Data management still feels awkward, but it won’t really be possible to polish it until we have a dataframe.

Some specific things on which I would love to get feedback:

Cheers,
Anand