Bounds checks for String and Bytes when retrieving (or setting) subparts thereof

hannes · April 2, 2024, 7:17pm

Dear all,

I sometimes have binary data that I need to decode. Code is normally roughly:

let guard p e = if p then Ok () else Error e

let ( let* ) = Result.bind

let decode_header str =
  let* () = guard (String.length str > 16) (`Msg "not enough data") in
  let ts = String.get_int64_le str 0 in
  let id = String.get_uint16_le str 8 in
  let ipv4 = String.get_int32_le str 10 in
  let port = String.get_uint16_le str 14 in
  Ok (ts, id, ipv4, port)

As you may observe, there’s a check for the length of str being at least the expected amount of octets. Each String.get_ does a bounds check - even though all of them will be satisfied (otherwise the guard would result error, and remaining code wouldn’t be executed).

What I’m curious about whether someone has experimented with optimizations or type-level information holding the fact that str will be at least 16 octets? And thus the bounds checks could be avoided at compile time? Of course I can roll this myself, in an unsafe way (by pretending it is obvious, and using unsafe getters after the length check) – or find some type encoding (very basic dependent types) to encode that the length of str will be at least 16. But I’m interested if there’s some prior art (or if e.g. flambda/flambda2 is smart enough to optimize away the bounds checks)?

Thanks a lot for reading, Hannes

vlaviron · April 3, 2024, 11:08am

We’ve been thinking about this for a while. It’s not done currently (neither in flambda nor in flambda2), we know how to do it in flambda2, but so far it has never been high enough on our priority list to actually implement it.

We usually think more about the slightly harder problem of removing checks in loops (for i = 0 to Array.length a - 1 do ... a.(i) ... done), adding support only for comparison with constants would be even easier, but wouldn’t handle versions of your code where you start reading at a non-zero offset.

hannes · April 3, 2024, 2:20pm

Thanks for your reply @vlaviron. Other solutions I have in mind include a ppx rewriter - but then I never worked on such a thing. Is it feasible to do such computations / tracking in ppx? Is the ppx ecosystem nowadays stable and maintained for new OCaml releases?

hannes · April 3, 2024, 3:02pm

I heard from @dinosaure he worked on using the type system several years ago lavoisier.ml · GitHub – that’s not too satisfying and pretty cumbersome to use.

reynir · April 3, 2024, 6:24pm

As ppx work on a syntactic level there are limits to what you can do. For example, I don’t think it is useful to try to write a ppx that translates

let decode_header str =
  let* () = [%guard str > 16] in
  let ts = String.get_int64_le str 0 in
  let id = String.get_uint16_le str 8 in
  let ipv4 = String.get_int32_le str 10 in
  let port = String.get_uint16_le str 14 in
  Ok (ts, id, ipv4, port)

into a version that uses accessors without bounds checks as the ppx would need to track if str is shadowed (doable but annoying), the ppx would be limited in what expressions it can convert (e.g. what if we instead write let port = String.get_uint16_le str (off + 14) in) and finally it would be somewhat brittle to shadowing of stdlib functions (for example if another unrelated String module is in scope).

Instead I would look at using ppx for generating decode_header from some specification language. Maybe it could be something like:

let decode_header = [%decode ( int64_le, uint16_le, int32_le, uint16_le )]

which would use the sizes of the int types to compute the expected length and the offsets. This would be close to what ppx_cstruct does. An issue is you can end up generating a lot of code that you don’t use, and it can be somewhat inflexible (what if you want to read from an offset? and in other cases you always read from offset 0. If you add an optional offset argument you now introduce back some branching and a dead code path in some cases).

A ppx could be useful to make @dinosaure’s lavoisier.ml less painful to use by expanding e.g. [%peano 16] into Peano.(z s s s s s s s s s s s s s s s t) (I hope I got the count right). But I suspect this is maybe not the most painful part of lavoisier.ml.

Regarding stability I think ppxlib has greatly improved the stability of the ppx ecosystem although I don’t think it will completely safeguard a ppx from breaking between OCaml releases.

octachron · April 4, 2024, 2:33pm

A ppx reduces a bit the pain (see tensority/examples/ma.ml at master · Octachron/tensority · GitHub for instance), but it does not really alleviate the core issue that the OCaml type system is not great for proving arithmetic properties.

My current opinion is that the best compromise might be to use a core module and phantom types to encode the smallest set of property that you want to check. For instance, src/Indexing.mli · master · POTTIER Francois / fix · GitLab defines an iteri function where the type system reflects that the index provided by iteri is always safe to use to access elements of vector of the same semi-static size:

val iteri : ('size index -> 'elt -> unit) -> ('size, 'elt) vector -> unit

Armael · April 4, 2024, 3:30pm

I believe Oleg’s page Lightweight Static Guarantees is also a relevant read regarding this pattern of using “branded types” for array accesses, this subsection in particular. But of course the use of constant integer offsets is always going to be awkward to encode in the type system…

Armael · April 4, 2024, 3:36pm

If the “binary data decoding” code in question is sufficiently self-contained, another option could be to write it in a verification-enabled tool such as Why3, then extract the verified code to ocaml.
In that scenario you would write the code using the unsafe variants of the get/set functions, but equip them with pre-conditions that require you to prove that the accesses are in-bound. And you would simply prove that decode_header has pre-condition true and post-condition true, i.e. that it is safe (saying nothing about its result but asserting that all accesses are indeed in-bounds).

pitag · April 12, 2024, 2:37pm

Indeed. To clarify a bit, though: It’s not actually true that a PPX might break between OCaml releases. At the beginning of a compiler release phase, we release a ppxlib that’s compatible with the new compiler (first a preview release and then a stable release). With that ppxib, every code, including any PPX, that worked before, works as well with the new compiler.

However, in some following ppxlib version, we might bump ppxlib’s AST to the one of the new compiler. And that’s the point at which there’s indeed no guarantee that we won’t break your PPX (unless your PPX is fully stable. More in the next paragraph). If your PPX is released on opam-repo (or belongs to a project released on opam-repo), we’ll add a ppxlib bound to its opam file to guarantee that your PPX will be installed with a compatible ppxlib version. If your PPX/project additionally builds with dune, then we also open a patch PR on the PPX to make it compatible with the latest ppxlib again. That PR would require work both from us and from you (the PPX author) as you’re the only ones really knowing the semantics of the PPX. If your PPX is not released on opam-repo, then we don’t have visibility over it and can’t help patch it.

That being said, depending on the complexity of the PPX, it can be possible to write it in a way that we guarantee never to break it! But that will depend on you and on how you write your PPX. Happy to give more details on that in case you’re interested.

Topic		Replies	Views
Difference between String.unsafe_set and String.set Learning	7	1468	June 13, 2017
Code review: Simple template string parser Learning	27	3410	October 5, 2018
What's wrong with StdLib.Str? Learning	3	556	May 28, 2020
Binary encoding using Stdint Learning	2	1226	December 27, 2017
Warning when using `a.[b] <- c` operator with bytes Learning syntax , safe-string , bytes	8	2866	March 26, 2018

Bounds checks for String and Bytes when retrieving (or setting) subparts thereof

Related Topics