[ANN] Major Release of Base64 / Article

dinosaure · February 8, 2019, 11:02am

MirageOS is a library operating system written from the ground up in OCaml.
It has an impossible and incredibly huge goal to re-implement all of the
world! Looking back at the work accomplished by the MirageOS team, it appears that’s
what happened for several years. Re-implementing the entire stack, in particular
the lower layers that we often take for granted, requires a great attention to
detail. While it may seem reasonably easy to implement a given RFC, a huge
amount of work is often hidden under the surface.

In this article, we will explain the development process we went through, as we
updated a small part of the MirageOS stack: the library ocaml-base64. It’s a
suitable example as the library is small (few hundreds lines of code), but it
needs ongoing development to ensure good quality and to be able to trust it for
higher level libraries (like mrmime).

Updating the library was instigated by a problem I ran into with the existing
base64 implementation while working on the e-mail stack. Indeed, we got some
errors when we tried to compute an encoded-word according to the RFC
2047. So after several years of not being touched, we decided to
update ocaml-base64.

The Critique of Pure Reason

The first problem

We started by attempting to use ocaml-base64 on some examples extracted from
actual e-mails, and we quickly ran into cases where the library failed. This
highlighted that reality is much more complex than you can imagine from reading
an RFC. In this situation, what do you do: try to implement a best-effort
strategy and continue parsing? Or stick to the letter of the RFC and fail? In
the context of e-mails, which has accumulated a lot of baggage over time, you
cannot get around implementing a best-effort strategy.

The particular error we were seeing was a Not_found exception when decoding an
encoded-word. This exception appeared because the implementation relied on
String.contains, and the input contained a character which was not part of the
base64 alphabet.

This was the first reason why we thought it necessary to rewrite ocaml-base64.
Of course, we could just catch the exception and continue the initial
computation, but then another reason appeared.

The second problem

As @clecat and I reviewed RFC 2045, we noticed the following
requirement:

The encoded output stream must be represented in lines of no more than 76
characters each.

See RFC 2045, section 6.8

Pretty specific, but general to e-mails, we should never have more than 78
characters per line according to RFC 822, nor more than 998 characters
according to RFC 2822.

Having a decoder that abided RFC 2045 more closely, including the requirement
above, further spurred us to implement a new decoder.

As part of the new implementation, we decided to implement tests and fuzzers to
ensure correctness. This also had the benefit, that we could run the fuzzer on
the existing codebase. When fuzzing an encoder/decoder pair, an excellent check
is whether the following isomorphism holds:

let iso0 input = assert (decode (encode input) = input)
let iso1 input = assert (encode (decode input) = input)

However, at this point [@hannesm][hannes] ran into another error (see
#20).

The third problem

We started to review the nocrypto implementation of base64, which
respects our requirements. We had some concerns about the performance of the
implementation though, so we decided to see if we would get a performance
regression by switching to this implementation.

A quick benchmark based on random input revealed the opposite, however!
nocrypto’s implementation was faster than ocaml-base64:

ocaml-base64's implementation on bytes (length: 5000): 466 272.34ns
nocrypto's implementation on bytes (length: 5000): 137 406.04ns

Based on all these observations, we thought there was sufficient reason to
reconsider the ocaml-base64 implementation. It’s also worth mentioning that
the last real release (excluding dune/jbuilder/topkg updates) is from Dec.
24 2014. So, it’s pretty old code and the OCaml eco-system has improved a lot
since 2014.

Implementation & review

We started integrating the nocrypto implementation. Of course, implementing
RFC 4648 is not as easy as just reading examples and trying to do
something which works. The devil is in the detail.

@hannesm and @cfcs decided to do a big review of expected behavior
according to the RFC, and another about implementation and security issues.

Canonicalization

The biggest problem about RFC 4648 is regarding canonical inputs. Indeed, there
are cases where two different inputs are associated with the same value:

let a = Base64.decode "Zm9vCg==" ;;
- : string = "foo\n"
let b = Base64.decode "Zm9vCh==" ;;
- : string = "foo\n"

This is mostly because the base64 format encodes the input 6 bits at a time. The
result is that 4 base64 encoded bytes are equal to 3 decoded bytes (6 * 4 = 8 * 3). Because of this, 2 base64 encoded bytes provide 1 byte plus 4 bits. What do
we need to do with these 4 bits? Nothing.

That’s why the last character in our example can be something else than g. g
is the canonical byte to indicate using the 2 bits afterward the 6 bits
delivered by C (and make a byte - 8 bits). But h can be used where we just
need 2 bits at the end.

Due to this behavior, the check used for fuzzing changes: from a canonical
input, we should check isomorphism.

Invalid character

As mentioned above (“The first problem”), how should invalid characters be
handled? This happens when decoding a byte which is not a part of the base64
alphabet. In the old version, ocaml-base64 would simply leak a Not_found
exception from String.contains.

The MirageOS team has taken a stance on exceptions, which is
to “use exceptions for exceptional conditions” - invalid input is hardly one of
those. This is to avoid any exception leaks, as it can be really hard to track
the origin of an exception in a unikernel. Because of this, several packages
have been updated to return a result type instead, and we wanted the new
implementation to follow suit.

On the other hand, exceptions can be useful when considered as a more
constrained form of assembly jump. Of course, they break the control flow, but
from a performance point of view, it’s interesting to use this trick:

exception Found

let contains str chr =
  let idx = ref 0 in
  let len = String.length str in
  try while !idx < len
      do if String.unsafe_get str !idx = chr then raise Found ; incr idx done ;
      None
  with Found -> Some !idx

This kind of code for example is ~20% faster than String.contains.

As such, exceptions can be a useful tool for performance optimizations, but we
need to be extra careful not to expose them to the users of the library. This
code needs to be hidden behind a fancy functional interface. With this in mind,
we should assert that our decode function never leaks an exception. We’ll
describe how we’ve adressed this problem later.

Special cases

RFC 4648 has some detailed cases and while we would sometimes like to work in a
perfect world where we will never need to deal with such errors, from our
experience, we cannot imagine what the end-user will do to formats, protocols
and such.

Even though the RFC has detailed examples, we have to read between lines to know
special cases and how to deal with them.

@hannesm noticed one of these cases, where padding (= sign at the end of
input) is not mandatory:

The pad character “=” is typically percent-encoded when used in an URI [9],
but if the data length is known implicitly, this can be avoided by skipping
the padding; see section 3.2.

See RFC 4648, section 5

That mostly means that the following kind of input can be valid:

let a = Base64.decode ~pad:false "Zm9vCg"
- : string = "foo\n"

It’s only valid in a specific context though: when length is known implicitly.
Only the caller of decode can determine whether the length is implicitly known
such that padding can be omitted. To that end, we’ve added a new optional
argument ?pad to the function Base64.decode.

Allocation, `sub`, `?off` and `?len`

Xavier Leroy has described the garbage collector in the following way:

You see, the Caml garbage collector is like a god from ancient mythology:
mighty, but very irritable. If you mess with it, it’ll make you suffer in
surprising ways.

That’s probably why my experience with improving the allocation policy of
ocaml-git was quite a nightmare. Allowing the user to control
allocation is important for efficiency, and we wanted to ocaml-base64 to be a
good citizen.

At the beginning, ocaml-base64 had a very simple API:

val decode : string -> string
val encode : string -> string

This API forces allocations in two ways.

Firstly, if the caller needs to encode a part of a string, this part needs to be
extracted, e.g. using String.sub, which will allocate a new string. To avoid
this, two new optional arguments have been added to encode: ?off and ?len,
which specifies the substring to encode. Here’s an example:

(* We want to encode the part 'foo' without prefix or suffix *)

(* Old API -- forces allocation *)
Base64.encode (String.sub "prefix foo suffix" 7 3) ;;
- : string = "Zm9v"

(* New API -- avoids allocation *)
Base64.encode ~off:7 ~len:3 "prefix foo suffix" ;;
- : string = "Zm9v"

Secondly, a new string is allocated to hold the resulting string. We can
calculate a bound on the length of this string in the following manner:

let (//) x y =
  if y < 1 then raise Division_by_zero ;
  if x > 0 then 1 + ((x - 1) / y) else 0

let encode input =
  let res = Bytes.create (String.length input // 3 * 4) in
  ...

let decode input =
  let res = Bytes.create (String.length input // 4 * 3) in
  ...

Unfortunately we cannot know the exact length of the result prior to computing
it. This forces a call to String.sub at the end of the computation to return a
string of the correct length. This means we have two allocations rather than
one. To avoid the additional allocation, @avsm proposed to provide a new
type sub = string * int * int. This lets the user call String.sub if
required (and allocate a new string), or use simply use the returned sub for
_blit_ting to another buffer or similar.

Fuzz everything!

There’s a strong trend of fuzzing libraries for MirageOS, which is quite easy
thanks to the brilliant work by @yomimono and @stedolan!
The integrated fuzzing in OCaml builds on American fuzzy lop, which is
very smart about discarding paths of execution that have already been tested and
generating unseen inputs which break your assumptions. My first experience with
fuzzing was with the library decompress, and I was impressed by
precise error it found about a name clash.

Earlier in this article, I listed some properties we wanted to check for
ocaml-base64:

The functions encode and decode should be be isomorphic taking
canonicalization into account:

let iso0 input =
  match Base64.decode ~pad:false input with
  | Error _ -> fail ()
  | Ok result0 ->
    let result1 = Base64.encode_exn result0 in
    match Base64.decode ~pad:true result1 with
    | Error _ -> fail ()
    | Ok result2 -> check_eq result0 result2

let iso1 input =
  let result = Base64.encode_exn input in
  match Base64.decode ~pad:true result0 with
  | Error _ -> fail ()
  | Ok result1 ->
    let result2 = Base64.encode_exn result1 in
    check_eq result0 result2

The function decode should never raise an exception, but rather return a
result type:

let no_exn input =
  try ignore @@ Base64.decode input with _ -> fail ()

And finally, we should randomize ?off and ?len arguments to ensure that we
don’t get an Out_of_bounds exception when accessing input.

Just because we’ve applied fuzzing to the new implementation for a long time, it
doesn’t mean that the code is completely infallible. People can use our library
in an unimaginable way (and it’s mostly what happens in the real world) and get
an unknowable error.

But, with the fuzzer, we’ve managed to test some properties across a very wide
range of input instead of unit testing with random (or not so random) inputs
from our brains. This development process allows fixing the semantics of
implementations (even if it’s not a formal definition of semantics), but
it’s better than nothing or outdated documentation.

Conclusion

Based on our recent update to ocaml-base64, this blog post explains our
development process as go about rewriting the world to MirageOS, one bit at a
time. There’s an important point to be made though:

ocaml-base64 is a small project. Currently, the implementation is about 250
lines of code. So it’s a really small project. But as stated in the
introduction, we are fortunate enough to push the restart button of the computer
world - yes, we want to make a new operating system.

That’s a massive task, and we shouldn’t make it any harder on ourselves than
necessary. As such, we need to justify any step, any line of code, and why we
decided to spend our time on any change (why we decided to re-implement git
for example). So before committing any time to projects, we try to do a deep
analysis of the problem, get feedback from others, and find a consensus between
what we already know, what we want and what we should have (in the case of
ocaml-base64, @hannesm did a look on the PHP implementation and the Go
implementation).

Indeed, this is a hard question which nobody can answer perfectly in isolation.
So, the story of this update to ocaml-base64 is an invitation for you to enter
the arcanas of the computer world through MirageOS ! Don’t be afraid!

NOTE: cross-post of this article

avsm · February 8, 2019, 11:04am

A huge thank you from me to @dinosaure for his diligence and persistence with upstreaming his discoveries to the MirageOS libraries as well

jhw · February 8, 2019, 4:34pm

This is an excellent article! Thanks for writing it up.

p.s. I don’t want to hijack this thread, but I’d like to mention that all three of the problems discussed in this article were already known to me when I wrote the Base64 implementation in my forthcoming Orsetto framework of structured data interchange languages. (Expect an announcement before Monday about its preview release. Also, for the MirageOS people: my implementations are generally not speed demons. I’ve left a lot of room for myself to make performance improvements in the future.)

Topic		Replies	Views
Hannes Mehnert interview about MirageOS and OCaml by Evrone Ecosystem irmin , mirageos , announce , dune	15	2914	June 1, 2020
[ANN] reed-solomon-erasure 1.0.1 Ecosystem announce	13	1861	August 1, 2018
Opam repository archival, phase 2 - OCaml 4.08 is the lower bound Ecosystem announce , opam-repository	21	977	February 26, 2025
What is the preferable solution for the role of standard library? Learning core , standardlibrary	37	9697	December 22, 2017
[ANN] BAP 2.1.0 Release Community announce , bap	26	4712	September 22, 2020