[ANN] SZXX v4 (and Eio)

Github link

I’ve recently released v4 of SZXX and it’s now built on Eio instead of Lwt.

It made me realize just how much complexity we’ve all accepted as the cost of doing asynchronous IO.

The code was so thoroughly simplified thanks to Eio’s non-monadic interface (no more colored functions that infect everything with 'a Lwt.t) and its concept of Switch that I was able to implement complex features and optimizations that I had previously deemed too costly in terms of development time and added complexity.

v4 is more than 3x faster, much easier to use correctly, and offers stronger memory usage guarantees, all thanks to Eio.


SZXX is a Streaming ZIP, XML and XLSX library.

It can stream data out of these 3 file formats even when reading from a network socket, either in constant memory or with user-defined memory usage guarantees.

All 3 formats are quite “quirky” to say the least. XLSX (aka. OOXML) is infamous for being difficult to stream. I could talk at length about all the different subspecies of ZIP files!

Whenever giving up the non-seekability requirement (network streams, etc.) brings benefits, SZXX offers both interfaces: an easier and/or more performant function that may “jump” around a file, and also a more advanced non-seeking function.

Hundreds of hours of benchmarking, optimization and testing have gone into this latest release to squeeze out as much performance as possible and I’m extremely pleased with the result.

I hope it proves useful to the OCaml ecosystem. Feel free to ask any questions about SZXX, Eio, XLSX, ZIP, XML, etc.

30 Likes

Thanks for your update and maintaining the code ! szxx is in my toolbox at work and I’m very happy with it!

1 Like

I’m a bit curious on why you need to embed all the Eio machinery rather than have simple functions to IO (and seek if it’s needed).

One of the reasons why effects are nice for codecs is precisely because it allows you to separate the concerns of getting/pushing bytes from the final mecanism that does so (synchronous, asynchronous, whatever, see the pattern here if that doesn’t resonate).

I’d be actually quite interested of being able to parse .xlsx files with OCaml, having relied on ad-hoc tools or python in the past, but your dependency cone makes it a bit of a non-starter for me (eio, core).

4 Likes

I can understand Eio because a lot of people will be using it soon (we hope); but puzzled by the use of open! Core everywhere.

I would like to extend that if you were able to switch from lwt to eio, it’s because, among others, authors of xmlm or decompress did the choice to not depend on something like eio or lwt :wink: .

I don’t think this uses xmlm. But sure in general not using something unless you absolutely need to is a better path to write lean, low maintenance and composable software.

Whew lots of comments about Core! I haven’t tried removing it before, simply because this project evolved out of a larger closed source project and versions prior to v4 relied on it much more than v4 does, so it might happen as part of a version 4.1 or something like that.

Removing Eio would be a tad more difficult since Eio.Stream is a large part of why the interface is so nice. I’m hugely leveraging Eio and I don’t want to recreate it at a different abstraction level. Processing XLSX in particular involves fairly tricky resource management (I wouldn’t call it a codec), so SZXX makes heavy use of promises, switches, and streams internally, that would all have to be duplicated. SZXX is probably using more of Eio than the applications that call SZXX! Removing Core though, yes.

This is why I posted it here, it’s interesting to see what people care about.

I don’t think this uses xmlm . But sure in general not using something unless you absolutely need to is a better path to write lean, low maintenance and composable software.

Correct, it does not use xmlm. About limiting dependencies: it’s a funny feeling to be on the receiving end of a comment I routinely make to others :sweat_smile: Software development is a matter of tradeoffs, and with the number of large improvements in v4, I judged it more pressing to release this version as-is (with Core just like v1-3) than to delay its release longer.

2 Likes

I figured it was something like that. Established larger/professional teams tend to heavily rely on Jane Street libraries internally and their open source published software tends to reflect that. Congrats on the huge release!

1 Like

I’m not asking you to. I hope at some point we can find simple function based IO abstractions for streams and seekable bytes and upstream them in the Stdlib as a good basis for people to use in their codecs (e.g. in the style of @c-cube iostream).

I fact what rather got me curious was:

because that’s personally (only) one of the aspects of Eio I find rather unconvincing, since it allows users to build complex side effecting code whose boundaries are large and potentially difficult to understand. So I was curious how your were using that in the library.

I find it funny how everyone seems on board with structured concurency (basically your activities life span match the scoping rules of your language) but then have no problem with reintroducing something behind that make it again more difficult than it could be to understand the life spans of your concurrent code.

2 Likes

Thanks for linking iostream :-).

Now to derail the thread further: I have a branch of it that uses objects. It’s actually a pretty good fit, but I’m worried it’ll put some people off. Not sure where to best ask for feedback about that…

About structured concurrency: please take what I say with a full shaker of salt, but my understanding of switches is that they’re only used to cancel the entire subtree of scopes under the switch? Intuitively that’s like raising an exception in regular blocking code, except that here the lexical scopes form a tree and not just a list, so the effect has to propagate “up” in all the scopes under the switch. But that’s still kind of intuitive, if you want cancellation.

1 Like

Switches are values that are manually threaded. That’s both error prone and means you can have arbitrary interleaving of cancellation scopes that are potentially non-interfering which I find needlessly confusing.

In affect cancelation scopes are simply aligned on fiber scopes (which are function scopes) which I find much more palatable to work with and reason about – not to mention it’s one less concept in the picture.

Please just don’t :–)

Global open of a module should be a compiler warning. :nerd_face:
The compiler message should be “At line 42: Hey! This is not Haskell code!”.

1 Like

@dbuenzli Great (on topic!) question. It’s going to be a long answer and I want to caveat it by making it clear that this is just my opinion at this point in time. My opinions on this topic aren’t held as strongly as my opinions on other related topics, so my views are likely to evolve radically in the near future. I’m going to talk about the “Eio Switch” because I don’t have enough experience with other forms of structured concurrency.

I see the Switch is a stopgap measure. It’s a compromise and I expect that better solutions will take over in the medium to long term.

Yes, the Switch allows users to create overly large boundaries. To that I could point out that any tool can be misused, but that’s ignoring the fact that some tools are easier to misuse than others, or even encourage misuse due to their API design. The Eio developers have been talking about how the docs should emphasize that nesting Switches is a good thing, and that Switch scopes should be as fine-grained as possible.

I don’t see the Switch as a replacement for e.g. a proper dependent type based resource management system (or “whatever else comes next”), I see it as a replacement for the adhoc code that unfortunately NEEDS to simultaneously juggle 2+ streams, 2+ file descriptors and 3+ promises, split across multiple functions responding to various unpredictable events and failures. If the order in which these events occured was more strongly defined, the code could have been organized better, but there exists real situations that can’t be simplified/refactored any further and you’re left with a kind of “irreducible knot of resource management”. In those situations, the Switch really shines.

The Switch feels like bringing a GC into a world of manual memory management. You can still shoot yourself in the foot with a GC by creating a web of far-reaching and/or long-lived references across your program. Similarly, the Switch can be passed by side effect and end up recreating a sort of “shadow parallel scope” permeating throughout a code base. Overall, I see it as an improvement over the current state of things in most languages (including in OCaml). The potential for misuse is less than the status quo. Comparing a Switch to a GC trivializes the debate so I won’t push the analogy further.

Another advantage of the Switch is the minimal cognitive overhead it imposes on the developer. It’s fewer things to worry about, and the developer can feel safe in the knowledge that their background thread, file descriptors, etc, will all be closed when exiting the Switch’s scope. That leads to a different style of code, one less encumbered by edge case management. This in turn brings one more benefit: the code is easier to refactor and evolve over time because it’s smaller and simpler without the weight of resource management. All that without the (often substantial) increase in type-level complexity seen in Rust and friends (which I also see as a kind of stopgap measure, but with almost opposite pros and cons).

I expect there’s better solutions that already exist. But the Switch is available today in a high quality, high performance implementation, and I think its complexity-safety ratio will be difficult to beat.

4 Likes

In other news, I’ve completed the refactor of SZXX from Core to Base (a small subset of Core). It will be released as v4.1 in the next few days.

Since Base doesn’t contain Date and DateTime modules I’m now using our very own @dbuenzli’s ptime library. The total list of dependencies (5) is now: Base, Ptime, Angstrom, Decompress and Eio (plus a few PPX). None of the 5 have a significant dependency tree of their own.

I’m not going to reduce the list any further. Parsing XLSX is a lot closer to the duties of a library like CoHTTP or Caqti than it is to parsing JSON, XML or CSV. I think it’s easy to forget that.

4 Likes

I hope the one I’m developing in my copious spare time will be ready in time for comparisons whenever that starts to happen.

I’m refining some of the concepts that were more or less prototyped in my orsetto.cf library, where combinators for encoding and decoding structured data can be composed with combinators for I/O operations with various kinds of side effect semantics, e.g. synchronous, async/non-blocking, effects.

I’ve mostly drafted the simpler emit pipeline, i.e. 1) combinators to construct a 'a Data.scheme value that represents an encoder for type 'a to a Encoding.packet value, 2) combinators for producing a Serial.capsule from a packet and a set of encoder parameters, and 3) abstractions for writing a capsule to a channel that represents the semantics of output side effects.

I still need to draft the more complicated moral equivalent on the input side. Then I need to rewrite the languages in Orsetto to use the new platform.