It’s my pleasure to announce the first release of the bytesrw library:
Bytesrw extends the OCaml Bytes module with composable, memory efficient, byte stream readers and writers compatible with effect-based concurrency.
Except for byte slice life-times, these abstractions intentionally separate away ressource management and the specifics of reading and writing bytes.
Bytesrw distributed under the ISC license. It has no dependencies.
Optional support for compressed and hashed bytes depend, at your wish, on the C zlib, libzstd, blake3, libmd, xxhash and libraries.
The only reason I was longing for OCaml algebraic effects was so that I could avoid using them: when you write codecs on byte streams it should not be a concern where your bytes are coming from or headed to. The bytesrw library provides structures to abstract this. Additionally it establishes a buffer ownership discipline that enables byte streams to (de)compose while remaining memory efficient.
I do not expect the library to change much and it has been used. But it’s new and practice may call for adjustments. Do not hesitate to get in touch if you run into problems or see obvious defects or improvements. I do expect the library will add more convenience (e.g. for processing lines and UTF) and more optional stream formats over time.
Not in the library but jsont has a JSON codec using bytesrw for crunching bytes, see this topic. So if you need to compress these wasteful JSON serializations, you are all set.
It’s a slightly different design than my iostream but towards a different goal. It corresponds to my buffered reader/writer for the most part. I’m a bit surprised by the non empty requirement for slices but it’s a thoughtful design.
If history is any indicator, bytesrw might see good adoption, in which case I’m looking forward to it getting into the stdlib and my using it too :).
I’d however suggest to have distinct packages for each feature that has additional dependencies (eg zstd, blake, etc) because it’s easier to then explicitly pull bytesrw-zstd.
Thanks for your comment @c-cube. I’m glad you like the design.
Having a cursory look at them, I think they are quite different, there’s no notion of ressource in bytesrw and I don’t see any form of buffer ownership discipline that enables the buffer sharing between readers, writers and their clients that bytesrw allows.
Also note that in bytesrw you can’t choose how much you read and writers can’t choose how much they get written to, see this part of the design notes.
I think it’s good not to introduce the notion of buffering at this level. It can be quite economical. For example in jsont we work directly on byte stream reader provided slices, needing only a temporary 4 bytes buffer for those cases when an UTF-8 encoded character spans two slices.
I don’t think that’s written in the design notes. So here it goes.
In a corner of my head I remember dealing, a long time ago, with a stream system that would allow empty slices. It was quite brittle and messy to coordinate because not everyone agreed on what it meant. Some stream implementations would simply do not expect them and fail or run into infinite loops, other would interpret them as a signal for flushing and other would interpret it as the end of stream. It was unpleasant.
That doesn’t mean you should not allow them. But it indicates that if you allow them you should be very clear about their meaning.
Another thing I learned by implementing FRP systems for a language with side effects is that you should not invoke interconnected networks of functions gratuitously if nothing is happening. The semantics of bytesrw’s reader and writer is that you only get to observe finite parts of streams, slice by slice. Does it make sense to observe an empty slice or an arbitrary number of them between two bytefulls ? Not much in my opinion – except if you want to take an asynchronous stance but bytesrw definitively takes the direct style/blocking stance. It seems to me that for this semantics the natural interpretation of empty slices is: there are no more bytes in this stream; end of stream (bonus point: we don’t need an option or another signal to denote it).
Now deciding this doesn’t mean that it solves all of your problems. You no longer have the problem of each stream implementer interpreting empty slices in incompatible ways. However you have another problem which is: it’s very easy to produce empty slices if you are not careful. That’s the reason why the slice API is very explicit in distinguishing operations that may return empty slices (_or_eod suffixes, option values) from those that do not and loudly raise Invalid_argument if they do.
Having a cursory look at them, I think they are quite different, there’s no notion of ressource in bytesrw and I don’t see any form of buffer ownership discipline that enables the buffer sharing between readers, writers and their clients that bytesrw allows.
Also note that in bytesrw you can’t choose how much you read and writers can’t choose how much they get written to, see this part of the design notes.
They’re not exactly the same, indeed, but they should work in similar
use cases. iostream has a lower level non buffered reader that is BYOB,
and I like that bytesrw specifies that the lifetime of slices it returns
is limited to until the next call to read. I consider this to be close
enough to the notion of buffering I value, since it means read can use
an internal buffer from whence slices come.
I suppose a difference is that it’s harder to write some combinators in
the absence of explicit buffers (Iostream.In_buf.t is there partly
because it’s the one clean way that I know of to implement input_line
and similar things).
I think it’s good not to introduce the notion of buffering at this level. It can be quite economical. For example in jsont we work directly on byte stream reader provided slices, needing only a temporary 4 bytes buffer for those cases when an UTF-8 encoded character spans two slices.
I don’t think that’s written in the design notes. So here it goes.
In a corner of my head I remember dealing, a long time ago, with a stream system that would allow empty slices. It was quite brittle and messy to coordinate because not everyone agreed on what it meant. Some stream implementations would simply do not expect them and fail or run into infinite loops, other would interpret them as a signal for flushing and other would interpret it as the end of stream. It was unpleasant.
That doesn’t mean you should not allow them. But it indicates that if you allow them you should be very clear about their meaning.
That’s great, indeed. I think it’d be good if a standard byte slice
accepted empty slices without fuss (perhaps with a smart constructor to
return the preallocated eod if the size is zero), but enforcing the
invariant that slices are non empty in readers/writers is a good idea.
And anyway it reflects what lower level APIs do to signal eod, which is
to return 0 as the result of a read or write.
I have used Angstrom in the past to parse highly structured byte streams. Angstrom focuses on the parsing aspect (and claims to be efficient) and I assume this library does something else but I would be curious to hear about its support to read structured byte streams.
I’m not sure I understand what you mean by “highly structured bytes streams”. But in any case it’s not the aim of this library to provide you with parsing combinators. This library is about streaming bytes and crunching them in an efficient manner.
A mental model for what is being proposed is: composable {In,Out}_channel structures, without the notion of ressource and compatible with any kind of IO, including effect based ones – the library abstracts them as simple functions.
Otherwise said, if you ever had the problem of having to decompress a byte stream from an In_channel.t to give the result for decoding to another library that accepts an In_channel.t. for input, then this library solves this problem. That is if both library use the byte stream abstractions provided by this library.
The latter is the reason why, if this schemes works well in practice and gets used by the larger community I would like to eventually propose the Bytesrw.Bytes part of the library for addition to standard library. But don’t hold your breath on that for now.
Regarding decoding the byte streams into higher-level data structures (which is what I understand from your “highly structured byte streams”) then this library offers you nothing more than what {In,Out}_channel do – well it offers even less because you don’t get to choose the size of what you read or get to write (for memory efficiency). Buffering structure should be layered on top and can be quite minimal (for example as mentioned above an UTF-8 byte stream decoder can work directly on a stream’s byte slices without copying, except for a 4 byte buffer for those encoded Unicode character that may span two slices).
Ok so no you don’t get direct easy support to parse binary formats. But I don’t really see them as different from parsing UTF-8 based textual formats.
Basically you need here a decoding structure that holds the current slice and then a few primitives to read fixed size integers (or more generally make sure n bytes can be read) and handle reads that overlap two slices (I did this a long time ago in this DICOMM non-blocking bytes reading abstraction). Once you have your primitives it’s all direct style read_uint16, read_uint64, and beneath you work directly over the stream slices, eschewing copies except for overlapping reads.
ZIP is a little bit different because in ZIP you need to seek if you want to work in a streaming fashion.
I think that formats that need seeking can be handled by giving a function say seek: int -> unit and functions reader and writer that gives you a reader or writer that start pulling and pushing bytes from the seeked point. I refrained for now to introduce Bytes.Seekable.t structure, but having a type for that could be eventually a good idea (if only to share the code that creates them from an In_channel, fds or whatever IO abstraction you deal with).
The idea is that for example in the case of ZIP you would have a handle on an archive abstraction that has a seek function. Now after having seeked and parsed a central directory a function like Zipc.File.to_binary_string would simply seek at the right place and return a decompressing reader from the seeked point.
Another library in the same space is my iostream, which uses objects to provide modular, buffered and unbuffered, optionally seekable versions of the stdlib channels.
Unlike bytesrw however, iostreams are resources and provide a close
method (and possibly a flush as well for buffered writers). The rest
is similar, including the possibility of applying compression to a
stream.
Hopefully one of these abstractions makes it one day into the stdlib,
but I’m also not holding my breath.