Seq vs List, optimization

Use an array if you want things to go fast.
A bigarray even, if you only have numeric types.

As Jean-Christophe Filliatre has observed, it’s pretty rare that a list is what you really want, and in all those other instances, an array is faster and smaller. Which doesn’t seem to stop me from using lists everywhere, alas.

1 Like

I think everybody does use list as the default, because they are one of the basic data structures supported by the language.
Once (and if ever) you realize you have a performance problem, you change to something faster.

1 Like

There are contexts in which I find infinite lists to be very convenient.

Sometimes I write simulations where I set the initial state, and then use an (expensive) next-state function, and create a lazy sequence using that function. Suppose you run it in a REPL for 1000 steps, and see what happens. You can go back and look at previous states very easily, because you have the history of states stored in the sequence. And if you decide then that you need to look another 10 or 1000 steps further, you don’t have to recompute the first 1000 steps.

This strategy allows a nice modular structure to the code, too. Rather than writing a loop or a recursion in which you have to build in a termination condition, which might have to change for different purposes, you just write the infinite sequence, and then use a take function to get the number of steps you want, or a take_while function that checks for a condition. You can add whatever tests you want without changing the sequence definition, because the tests are external to the recursion that generates the sequence.

I found this strategy quite useful in Clojure, where the default sequences are lazy. I used the strategy in a project in OCaml too, first with Batteries.LazyList and then Core.Sequence, and in the end, I don’t think it was worth it. The small but significant conveniences for me of a lazy sequence were outweighed by the extra complexity of using separate libraries with their own quirks. (Or maybe it was just that particular project, though, where I ended up just re-running sequences all the time. Maybe I would have felt the same if I’d written the simulation in Clojure.)

(Another example–not about sequences, but about a lazy data structure: Despite some past reservations about Haskell, I’m currently learning it with a project that uses binary trees. The original mathematical definition of the trees makes them infinite, so I’m finding it natural to again just write an initial state and a next-branches recursion, and since Haskell is lazy, I don’t have to include a termination condition. In this case the recursion is not expensive, but I get the same modular simplicity to the definition of the tree. For this code it’s not really worth exploring the tree more than a few dozen steps out, since the size of the data grows exponentially, but it still seems natural to model a mathematically infinite tree as a lazy tree.)

A bit of a spoiler for an upcoming release of a few of our libraries at Nomadic Labs…

We had a bug report: calls to some RPCs exposed by some of our binaries would occasionally cause some lag. One of the root causes of the issue was JSON serialisation. The original serialisation scheme was intended for a limited range of uses (especially, small sizes) but then it was used outside of this intended range and some relatively big values were serialised and pushed down the RPC stack.

To circumvent this, we are about to release

  • a “json lexeme sequence” backend for our serialiser library: construct_seq : 'a encoding -> 'a -> json_lexeme Seq.t where json_lexeme = Jsonm.lexeme = [ `Null | `Bool of bool | … | `As | `Ae | `Os | `Oe ]
  • a json lexeme sequence to string sequence converter.

For this second part, we actually have three different converters intended for slightly different uses. They have different granularity, they have different allocation profiles, and they make slightly different assumption most notably about concurrency:

  • string_seq_of_json_lexeme_seq : chunk_size_hint:int -> json_lexeme Seq.t -> string Seq.t which uses one (1) internal buffer of size chunk_size_hint. Consuming one element of the resulting sequence causes several json lexemes to be consumed and printed onto the internal buffer until it is full. When this happens, a snapshot (copy) of the buffer is delivered in the Cons cell. So for chunk-size-hint of, say, 1Ko, the sequence translator uses roughly 1Ko of memory and emits 1Ko chunks of memory that the consumer is responsible for.
  • small_string_seq_of_json_lexeme_seq : json_lexeme Seq.t -> string Seq.t which translates each of the lexeme as a single string. It’s a little bit more than a simple Seq.map because it needs to insert separators and escape strings. It mostly returns statically allocated strings so there are no big allocations at all.
  • blit_instructions_seq_of_jsonm_lexeme_seq : buffer: bytes -> json_lexeme Seq.t -> (bytes * int * int) Seq.t which works somewhat similarly to the first one but uses buffer instead of allocating its own. And it returns a seq of (source, offset, length) which are intended to be blitted onto whatever the consumer wants to propagates the data too. This barely allocates at all (it currently does allocate relatively big chunks when escaping strings, but we have planned to improve this in the future. (The sequence returns a source to blit; this source is physically equal to buffer most of the time but not always; specifically, for large strings that are present within the json data, the sequence just points to them as a source.)

Note that the description above is a simplification: there is a bit more to it than that. Also note that all this is still Work In Progress. Check out Construct seq (!5) · Merge requests · Nomadic Labs / json-data-encoding · GitLab (the value to json lexeme sequence code) and json streaming (!19) · Merge requests · Nomadic Labs / data-encoding · GitLab (the json lexeme sequence to string sequence code).

4 Likes

iter is in each val iter : … function of the stdlib :slightly_smiling_face: . It doesn’t need to be in the stdlib as there is no need for a shared type definition.

Note that while it can be faster (especially with flambda) it’s also more limited in its use cases. The json lexing library described by @raphael-proust would not be possible with iter (or rather only with inversion of control).

1 Like

Something to keep in mind about inversion of control: it’s not trivial to use with an I/O monad. This is one of the reasons that we went for the Seq rather than passing a callback through the abstraction layers. The Seq approach leaves the control near the main loop of our RPC server which knows about Lwt, it thus able to sequence writes correctly and yield as appropriate.