Advice on representing arbitrary numbers in library

TL;DR: My library needs to parse arbitrarily large and precise numbers from an external source into something useful for its users, and I can’t decide between the bolded options below.


I’m working on a library for both parsing and outputting the KDL Document Language and I’ve run into an issue with how to represent numbers. KDL numbers don’t distinguish between int and float and can be arbitrarily large and precise. As such no built-in OCaml type is sufficient for representing KDL numbers on its own, and in choosing a solution there is a tradeoff between correctness/accuracy, performance, simplicity, and ease of use for library consumers. The options I’m considering are:

  • Just represent everything with arbitrary precision rationals (Q.t). This is simple and always accurate but means I have to convert arbitrary rationals to decimal scientific notation in order to be able to output KDL, which is not only difficult, it’s not even total! It’s also not great for performance.
  • Use a variant type between ints and floats. This is a simple performant solution that’s easy to use for the library consumer, but it means there’s lots of numbers that just aren’t representable accurately, and lots that aren’t representable at all.
  • Use a variant type but with more exotic number types. Modifying the previous option, you can swap out ints for Z.ts to tilt the balance more towards correctness for integers, and/or take a page out of Yojson’s book and add a new variant that stores big floats as strings, but this makes the library less nice to use and introduces some weirdness where floats just below Float.max_float are wildly imprecise but floats above it are represented perfectly accurately as strings.
  • Try to parse floats as float but retain a string representing the original value. This is sort of like what kdl-rs does, although I’d have to do it with a custom KdlFloat.t type. This solution is a pretty elegant compromise, although it leads to further questions: should the original string be retained, or formatted into a canonical form? Should I do the same thing with integers, or use Z.ts?
  • Functorize the whole thing over a user-provided number type, potentially including one or more of these other solutions as a built-in number type with the library. I think the benefits and downsides of this are pretty self-explanatory.

Does anyone have any advice for which to choose? If you think there’s an even better option not included in the above list, feel free to suggest it! Thanks in advance!

1 Like

…plus to_int, to_float, etc functions. Anything less ends up being problematic vis a vis correctness, and anything more is likely to be pretty cumbersome and/or foreign or unusual to the average library consumer, who in general doesn’t think about and mostly doesn’t care about arbitrary size or precision.

You should always retain and provide the original string via the API, that’s sort of a given. Whether you simply dump it unmodified when you serialize a tree repr of a document is an inevitable serialization-time configuration setting.

I would personally not bother with baking in any particular arbitrary-precision or bigint support; those that care about or need to deal with such things can pass along the incoming string representations to such libraries themselves.

2 Likes

Another option is my OCaml port of the Python implementation of decimal floating point numbers: opam - decimal

This primarily takes care of the representation issue (i.e. it makes sure 0.1 + 0.2 = 0.3, not 0.3000000001), but it still needs you choose pick a precision (or stick with the default one).

2 Likes

First of all, thanks so much for the response!

To be clear, is what you’re suggesting something like:

module Number : sig
  type t
  val of_string : string -> t
  val dump_string : t -> string
  val to_int : t -> int
  val to_int_opt : t -> int option
  val to_float : t -> float
end

I was thinking to at least separate into two separate types for ints and floats so that you could pattern match on which to expect: either something like

module Number : sig
  module Int : sig
    type t
    val dump_string : t -> string
    val to_int_opt : t -> int
   (* still needs to be opt bc int doesn't have an Inf type for overflows *)  
  end
  module Float : sig
    type t
    val dump_string : t -> string
    val to_float : t -> float
  end
  type t = Int of Int.t | Float of Float.t
  val of_string : string -> t
end

or maybe

module Number : sig
  type t
  type num = Int of int | Float of float
  val of_string : string -> t
  val dump_string : t -> string
  val to_num : t -> num
end

Either way though, I was a little bit confused by when you said

KDL is not a document language like Markdown or HTML, it’s more like JSON or YAML. Of the KDL libraries I looked at, kdl-rs is the only one that always retains the original string for numbers, and that’s only because it’s specifically supposed to be “document-oriented”. Anyway though, the reason I think it might be good to format strings into a canonical format is because that would help with implementing an equal function for Kdl.ts: It would be really nice to have

Kdl.(equal (parse_string "node 10_000_000_000") (parse_string "node 1e10"))

and that’s really only possible to implement through comparing strings if the string is kept in a canonical format.

Thanks again for the response!

Thanks for the suggestion!

I considered using that library actually, and if I end up functorizing over a number type then I still might include that as a built in option! The main issue was just that using a fixed precision decimal seemed a bit too much of a departure from the data model that KDL seems to assume for me to use it as the default – most Kdl libraries seem to parse numbers into either integers or floats. Plus you have all of the normal issues with fixed point numbers that they can be really inefficient in some pathological cases.

Actually, the decimal library is not fixed-point, it’s decimal floating point. From the API doc: ‘Decimal floating point has finite precision with arbitrarily large bounds’. I think it’s worth a few minutes of your time to check it out :wink:

1 Like

Which arithmetic operations do you need to perform over these numbers?

If all you need is to carry numbers around and occasionally compare them for equality, it should be enough to represent them as strings and convert them to Q.t or to @yawaramin’s big decimals when testing for equality.

If you need to compute with these numbers, do you need exact results? if not, which roundings are allowed? etc.

Yes, and I made my suggestion with that in mind. It looks like KDL doesn’t make any distinction between integers and reals and decimals, so anything that a library does automatically to interpret numeric strings may well be wrong in various contexts. To that point:

…those values aren’t equal, at least if one is concerned about the precision explicitly represented in the encoded numbers. A less incorrect but still fraught example would be in automatically converting file permissions masks to a canonical-yet-unhelpful decimal representation; 1604 might be the same numeric value as 0x644, but the latter implies a very particular meaning within the context that it might appear.

I appreciate the urge to have a canonical format, but it’s really hard to implement such a canonicalization when the underlying explicit representations are impoverished as in JSON, YAML, and apparently KDL.

Just fyi,

# Decimal.(of_string "10_000_000_000" = of_string "1e10");;
- : bool = true
1 Like

Well, that’s just incorrect, and really unfortunate. One can’t just expand the significand out of thin air like that.

I learned this the hard way while using JVM languages, and was very happy to lean on the JDK’s BigDecimal class’ treatment of things:

> (new java.math.BigDecimal("10000000000")).equals(new java.math.BigDecimal("1e10"))
false

That’s not universally recognized. Many use cases want bigdecimals to be equal even if their scales don’t match. My port preserves the behaviour of the Python Decimal class and incidentally the Scala BigDecimal wrapper works the same way.

(tbc, my prior message was in no way a critique of your port’s accuracy, etc)

Granted that there are many notions of equality over various domains, but providing a “relaxed” semantic by default in a context where precision is (often) essential is surprising.

Does decimal provide a stricter semantic via a different function? Python’s decimal.compare seems to do the same as its __eq__. (Edit: I guess there’s checking equality of two decimal’s tuples via as_tuple :man_shrugging:)

Oof, sorry for abandoning this thread a little – had a really busy few days – but this has given me a lot to think about!

I think that what I am going to go with is functorizing over a number type along the lines of

module type KDLNumber = sig
  type t
  of_string: ?type_ann:string -> string -> t
  to_string: t -> string
  equal: t -> t -> bool
  pp: Formatter.format -> t -> unit
end

and then provide a few basic built-in modules like

module Basic : KDLNumber with type t = [`Float of float | `Int of int]
module Rational : KDLNumber with type t = Q.t
module CanonicalString: KDLNumber with type t = string
module RawString : KDLNumber with type t = string

While this does certainly increase complexity, it isn’t that out of the ordinary for OCaml libraries and it allows for exact correctness while also getting out of the way of people who just need to be able to parse numbers like 4 and 1.5 with reasonable accuracy.

As for the rest of the thread:

I understand your point here, but I actually think it applies less to KDL than any of those other formats, as KDL has a feature specifically for adding that kind of metadata to values: type annotations! KDL allows you to add an annotation to values like so: (annotation)8.888, and reserves annotations like i32 and f64 and decimal128 specifically for specifying the intended data model of numbers. As such, I think that preserving the clues as to the intended data model that you can glean from the raw string are less important.

oh neat! Definitely looking into adding a builtin Decimal module after 0.1.0 then. It actually turns out KDL already reserves decimal64 and decimal128 as type annotations for IEEE-754 (2008) decimal types, but I’m not completely sure whether your decimal is technically conformant to that, or even if the original python one is. If so, that would allow you to unambiguously specify the intended precision in the KDL document itself, which would be pretty cool!