Feedback on first library before release

Hello! I’m rather new to the OCaml community, and am getting ready to release my first ever OCaml library, which is a library for parsing Wikidata JSON!

Before I release it and need to start worrying about making breaking changes, I’d like to get some feedback and make sure I’m not messing anything up majorly. Please take a look and tell me if you have any feedback or things I should do differently before I go ahead with the initial release. Thanks!

A few things I’m particularly worried / interested in feedback about are

  • library organization
  • library interface
  • function naming
  • exception handling
  • making sure I didn’t mess up the build process, and it can actually be built on other people’s machines

EDIT: Oh and I’m also a little worried about the name. Is Wikidata too broad of a name for a library that’s just for parsing Wikidata Json?

6 Likes

Disclaimer. It is a matter of taste so please don’t take the below for granted. Different people have different views and different experiences.

  1. Refrain from using algebraic data types (variants and records) in public interfaces. Keep everything abstract and provide corresponding accessors. Pros: it will make it easy to add new fields and change the inner representation. Cons: it will make people more annoyed.

  2. Prefer Result.t or Option.t to exceptions. You can provide functions that raise exceptions in addition to total functions. Pros: it will make explicit that a function is non-total and require special attention. Cons: some functions will require too much attention and extra performance cost.

  3. Be very reluctant to use OOP objects. Well, despite the O in OCaml, it is better not to use the objects unless it is very necessary (i.e., until you need open recursion and plan for implementation inheritance), which are the pros of the objects. The cons is tight coupling between components and leaked abstractions.

  4. Keep data as data. Use abstractions for policies and behaviors. This is probably the most important point that sort of defeats the whole purpose of your library that creates a thin abstraction over wikidata data model using rigid ADT types. I am not really sure what would be the purpose of this abstraction and expect that it will pose more problems than solve. E.g., using fields of records and constructors of variants to read the fields is cumbersome as they are not first class, therefore the library user will find itself having to repeat the same code patterns but with different field name. In other words, ADT is a bad abstraction, so bad that it is better to use stringly-indexed dictionaries or just directly the ATD-generated data types.

To summarize. My suggestion is to hide all data type definitions. You can keep the adt-generated files and provide them as the low-level interface, or you can hide them, but there is no need to duplicate them in the public interface. Instead, I would suggest focusing on the Wikidata data model as the knowledge representation system and try to model it with an abstract interface. This will automatically help you to elude OCaml objects, as there is no need to map each Wikidata object to a different type of OCaml object. They all should have the same type.

I would also focus on how your library should be used and develop user stories. The best libraries are those that were developed to solve concrete problems. If you design a library for the sake of library with no application in mind you will end up with an interface that is hard to use.

3 Likes

Thank you so much for the detailed review and critique! I can definitely see that I probably focused too much on making a slightly more OCaml-friendly version of the JSON rather than actually using it to represent the Wikidata data model in OCaml in a way useful to end users. To be clear, I think my library still serves a purpose (in particular the Wikidata JSON format is usually used for displaying information about Entities in a human-readable way which I feel my library can be quite ergonomic for in places, whereas the RDF format is used more for querying) but I can definitely see that there are a lot of things that could be implemented better.
However, there are some things I’m unsure of how I could fix. In particular, Items and Properties each share all but one field, which is why I chose objects; to be able to use row polymorphism. Using abstract types and accessors would require lots of double implementations for simple “represent this thing as a string” tasks, for example. Furthermore, the sheer variety of Wikidata Snak types (there are three types of Snaks, two empty and one which can hold 17 different types of data which are physically stored in 5 different ways) is why they seemed to me to be natural for variant types containing records. I don’t see how one could do this with abstract types without requiring dozens of different accessor functions specific to each data format.
If possible, do you have any good examples of libraries attempting to do similar things as mine in OCaml so that I can study how they solve these issues? One of the reasons my library is very OO is because I based mine partially off of the qwikidata library for Python.
Again, thank you so much!

1 Like

I agree but see how your strict typing actually hampers the printing task, e.g., citing your own example,

  let entity_string = match e with
  | Item i -> string_of_entity lang i
  | Property p -> string_of_entity lang p in
  print_endline entity_string

Types should serve a concrete purpose not be there to make things more typed :slight_smile: The purpose is simple - prevent errors, more precisely, prevent users from incorrectly using your interface. So if your main application is fetching data and transforming it into a human-readable document, then your abstractions should focus on making it easy to fold over text strings and handle possible errors and data anomalies. Contrary, if you’re envisioning that your target application has to analyze data, build queries, and so on, then you should use the type system to prevent confusion between entities of different types, i.e., ensuring the well-formedness of the data model. Another application would be actually building wikidata and submitting it, here you can also embrace OCaml type system to ensure that data is valid and anomaly-free. And, of course, you can have all the above-mentioned applications at once, in that case, you may think of providing several different interfaces (views) over the same data.

The good old parametric polymorphism fits perfectly (even better) here, e.g., here is the model for the value type from wikibase Data Model

which, and the rest of the data model, can be directly represented in OCaml,

module type Wikidata = sig
  type 'a value
  type 'a data
  type 'a entity
  type datatype
  type item
  type property
  type iri
  type geo
  type time

  module Value : sig
    type 'a t

    val cast : 'a -> unit t -> 'a
    val try_cast : 'a -> unit t -> 'a option
    val forget : 'a t -> unit t

    val compare : 'a t -> 'a t -> unit
    val pp : Format.formatter -> 'a t -> unit

    module Base : Base.Comparable.S with type t = unit value
  end

  module Entity : sig
    type 'a t = 'a entity value
    val id : 'a t -> iri
  end

  module Item : sig
    type t = item entity value
  end

  module Property : sig
    type t = property entity value
  end

  module Datatype : sig
    type t = datatype entity value
  end

  module Data : sig
    type 'a t = 'a data value
    val typeid : 'a t -> datatype entity value
  end

  module Geo : sig
    type t = geo data value

    val latitute : t -> float
    val longitute : t -> float
    val altitude : t -> float
  end

  module Time : sig
    type t = time data value
    val year : t -> int
  end
end

and so on. As you can see, we can represent the hierarchy item :> entity :> value directly in OCaml as item entity value, same as data values

that can be represented as time data value or geo data value.

Now, the natural question would be how to implement this interface. There are many options, and the good thing is that you can change your decision later without disturbing the end-users of your library, thanks to the abstract interface.

In fact, this interface could be implemented rapidly by just taking the following implementation for the 'a value type,

type _ value = {
   cls : string;
   obj : Yojson.Safe.t;
}

Yes, the type of the kind of value is erased from type, the trick called phantom typing. Now, you need to implement corresponding interfaces of for particular values, e.g., for GeoCoordinateValue using either Yojson.Safe.Util accessors or via Yojson ADT types. You will be in full control (and full responsibility) of whether the value is attributed with the correct phantom type. The cls field here acts as a runtime type information about the value that makes it easy to cast up and cast down values. (We might want to cast values down, for example to put them inside a homogenous collection, like list or set). Here are some more pieces of implementation that will make it clearer for you how you will implement your phantom typing,

open Yojson.Safe.Util

type geo = string
type 'a data = 'a

module Geo = struct
  type t = geo data value
  let latitude geo = to_number@@member "latitute" geo.obj
  let longitude geo = to_number@@member "longitude" geo.obj
end

You may immediately notice that having a common representation for values opens an opportunity for code reuse, e.g., we can define a %: operator for specifying object fields (and specify wikidata type system)

let ($:) name typ value = typ@@member name value.obj

(* wikidata types *)
let decimal = to_number
let integer = to_int
let string = to_string

and now the implementation of various values becomes pretty trivial and self-describing,

module Geo = struct
  type t = geo data value
  let latitude = "latitude" $: decimal
  let longitude = "longitude" $: decimal
  let altitude = "altitude" $: decimal
end

module Time = struct
  type t = time data value
  let year = "year" $: integer
  let month = "month" $: integer
  (* etc *)
end

And the phantom type system will guarantee that a user will never look for the year field in the geo data value.

If at some point you will decide that re-parsing the json representation of a value is not quite efficient (this would be true if you access fields very often). THen you might substitute the representation with data that is already turned into OCaml representation. Moreover, you can even provide several representations at the same time and provide for example Compiled : Wikidata view that will use ADT-generated representation underneath the hood. Or you can use sqlite or some other in-memory database as your representation, or Cap’n’Proto. In other words, you will have a lot of time and options to experiment with your representation implementation while your users (or the rest of your team) are not blocked and can use your library.

8 Likes

Again, thank you so much for the detailed suggestions! You’ve given me a lot to think about.
I’m not super familiar with GADTs and so I didn’t realize that you could polymorphise in that way, and will certainly consider this during my attempt to rewrite the interface for this library. One thing I’m worried about though is that there are some fields that Properties and Items share that Lexemes (a new type of Entity that I’d like to support at some point) don’t. Is there any way to polymorphise over Properties and Items for fields that Lexemes don’t have?
I do also have some minor quibbles with your suggested interface mostly due to the weird discrepancies between the Wikibase high level Data Model and the Wikidata JSON serialization. For example, it seems to imply infinite nesting of entities inside statements inside entities (fixing this is pretty simple – just represent nested entities with ids – but makes having a Value type that could either be an Entity or a DataValue redundant), and how it doesn’t really represent how each individual DataValue has a Datatype which determines how the data is interpreted. Hopefully though I’ll be able to solve these issues with some relatively minor tweaks and put out a library that works much better with OCaml idioms.
Thanks again!

No GADT were used or harmed in the making of this posting :slight_smile: Just plain OCaml with no advanced features.

Sure, why not. Moreover, you can do it later, e.g., you can introduce a new type 'a lexeme indexed by item and property, without breaking any existing code,

type 'a lexeme

module Lexeme : sig 
   type 'a t = 'a lexeme entity value
   (* the common interface for all lexemes *)
end

and refine the types of property and item as,

module Property : sig 
   type t = property lexeme entity value 
   ...
end

module Item : sig 
   type t = item lexeme entity value
   ...
end

So items are still entity values and any function applicable to 'an entity value is still applicable to them, but they are also now lexemes.

Please keep in mind that I am a five-minute expert in wikidata model, i.e., I spent five minutes on it, no more)

Yep, I would expect this. But it doesn’t stop you from hiding this details under the nice interface. You can represent a value as simple as int or as a string that represents its iris, then and store all objects in an external finite mapping, e.g., a map, something like,

type iris = string
type 'a value = iris
type data = Yojson.Safe.t
type state = {
   objs : data Map.Make(String).t;
}

now you can either require an extra parameter for each value operation, e.g.,

module Geo : sig  
  type t =  geo data value
  val latitude : t -> state -> double
end

Or, capture the recurring pattern state -> 'a which is also known as the Reader Monad, and represent all operations as monadic, e.g.,

module Geo : sig  
  type t =  geo data value
  val latitude : t -> double query
end

where 'a query is a reader monad. In fact, wrapping your interface into a monad gives you as the library designer a lot of freedom in the future, as monad abstracts the most important part of the computer program - the computation itself. This means that in future you might opt to choose a different computation model, e.g., you can plug in a database that uses lwt or, fetch data on the fly, instead of generating the full document.

2 Likes

About the Lexeme thing, I think you misunderstood me – Lexemes are a type of Entity the same way Properties and Items are. I think Lexemes break the parametric polymorphism you are describing, as they do not share most fields with Items and Properties. For example, as Lexemes represent a word in a specific language and writing system they don’t have multilingual labels and instead have lemmas, usually in just one language. If I have

type 'a value
type 'a entity
type item
type property
type lexeme

module Item : sig
    type t = item entity value
    val label : t -> lang -> string
end

module Property: sig
    type t = property entity value
    val label : t -> lang -> string
end

module Lexeme : sig
    type t = lexeme entity value 
    val lemmas: t -> (lang * string) list
end

Then, if I understand correctly, there’s no way to make a function that takes an Entity and a lang and returns the Entity’s label if it’s an Item or Property and the Entity’s first lemma string if it’s a Lexeme.

I guess this could be solved with something like

type entity_var =
    | Item of item entity value
    | Property of property entity value
    | Lexeme of lexeme entity value

But I think that might defeat the purpose.

Ah, I see, still no problem. You need to flip this around, create an abstraction that is shared by all three, and then make derived from it. You can also create a special abstraction for items a properties, e.g., Labeled.

1 Like

Thanks, but I’m not completely sure I understand what you mean by this.
I also am kind of confused about the type of this function – it seems to take something of type 'a and a unit value and attempt to convert the unit value to the 'a, but I don’t really understand what it does with that first argument nor how it is implemented.

Again, is there anything I can read up on or a library I could study to learn more about this kind of interface? I don’t want to take up too much of your time asking questions :slight_smile:

It should be: 'a -> unit t -> 'a t, of course :slight_smile: