[ANN] Bytesrw 0.1.0 – Composable byte stream readers and writers

Hello,

It’s my pleasure to announce the first release of the bytesrw library:

Bytesrw extends the OCaml Bytes module with composable, memory efficient, byte stream readers and writers compatible with effect-based concurrency.

Except for byte slice life-times, these abstractions intentionally separate away ressource management and the specifics of reading and writing bytes.

Bytesrw distributed under the ISC license. It has no dependencies.

Optional support for compressed and hashed bytes depend, at your wish, on the C zlib, libzstd, blake3, libmd, xxhash and libraries.

The only reason I was longing for OCaml algebraic effects was so that I could avoid using them: when you write codecs on byte streams it should not be a concern where your bytes are coming from or headed to. The bytesrw library provides structures to abstract this. Additionally it establishes a buffer ownership discipline that enables byte streams to (de)compose while remaining memory efficient.

I do not expect the library to change much and it has been used. But it’s new and practice may call for adjustments. Do not hesitate to get in touch if you run into problems or see obvious defects or improvements. I do expect the library will add more convenience (e.g. for processing lines and UTF) and more optional stream formats over time.

Homepage: https://erratique.ch/software/bytesrw
Docs: https://erratique.ch/software/bytesrw/doc or odig doc bytesrw
Install: opam install bytesrw conf-zlib conf-zstd conf-libblake3 conf-libmd conf-xxhash (opam PR)

This first release was made possible thanks to a grant from the OCaml Software Foundation. I also thank my donors for their support.

Best,

Daniel

22 Likes

Not in the library but jsont has a JSON codec using bytesrw for crunching bytes, see this topic. So if you need to compress these wasteful JSON serializations, you are all set.

1 Like

It’s a slightly different design than my iostream but towards a different goal. It corresponds to my buffered reader/writer for the most part. I’m a bit surprised by the non empty requirement for slices but it’s a thoughtful design.

If history is any indicator, bytesrw might see good adoption, in which case I’m looking forward to it getting into the stdlib and my using it too :).

I’d however suggest to have distinct packages for each feature that has additional dependencies (eg zstd, blake, etc) because it’s easier to then explicitly pull bytesrw-zstd.

3 Likes

Thanks for your comment @c-cube. I’m glad you like the design.

Having a cursory look at them, I think they are quite different, there’s no notion of ressource in bytesrw and I don’t see any form of buffer ownership discipline that enables the buffer sharing between readers, writers and their clients that bytesrw allows.

Also note that in bytesrw you can’t choose how much you read and writers can’t choose how much they get written to, see this part of the design notes.

I think it’s good not to introduce the notion of buffering at this level. It can be quite economical. For example in jsont we work directly on byte stream reader provided slices, needing only a temporary 4 bytes buffer for those cases when an UTF-8 encoded character spans two slices.

I don’t think that’s written in the design notes. So here it goes.

In a corner of my head I remember dealing, a long time ago, with a stream system that would allow empty slices. It was quite brittle and messy to coordinate because not everyone agreed on what it meant. Some stream implementations would simply do not expect them and fail or run into infinite loops, other would interpret them as a signal for flushing and other would interpret it as the end of stream. It was unpleasant.

That doesn’t mean you should not allow them. But it indicates that if you allow them you should be very clear about their meaning.

Another thing I learned by implementing FRP systems for a language with side effects is that you should not invoke interconnected networks of functions gratuitously if nothing is happening. The semantics of bytesrw’s reader and writer is that you only get to observe finite parts of streams, slice by slice. Does it make sense to observe an empty slice or an arbitrary number of them between two bytefulls ? Not much in my opinion – except if you want to take an asynchronous stance but bytesrw definitively takes the direct style/blocking stance. It seems to me that for this semantics the natural interpretation of empty slices is: there are no more bytes in this stream; end of stream (bonus point: we don’t need an option or another signal to denote it).

Now deciding this doesn’t mean that it solves all of your problems. You no longer have the problem of each stream implementer interpreting empty slices in incompatible ways. However you have another problem which is: it’s very easy to produce empty slices if you are not careful. That’s the reason why the slice API is very explicit in distinguishing operations that may return empty slices (_or_eod suffixes, option values) from those that do not and loudly raise Invalid_argument if they do.

1 Like

Having a cursory look at them, I think they are quite different, there’s no notion of ressource in bytesrw and I don’t see any form of buffer ownership discipline that enables the buffer sharing between readers, writers and their clients that bytesrw allows.

Also note that in bytesrw you can’t choose how much you read and writers can’t choose how much they get written to, see this part of the design notes.

They’re not exactly the same, indeed, but they should work in similar
use cases. iostream has a lower level non buffered reader that is BYOB,
and I like that bytesrw specifies that the lifetime of slices it returns
is limited to until the next call to read. I consider this to be close
enough to the notion of buffering I value, since it means read can use
an internal buffer from whence slices come.

I suppose a difference is that it’s harder to write some combinators in
the absence of explicit buffers (Iostream.In_buf.t is there partly
because it’s the one clean way that I know of to implement input_line
and similar things).

I think it’s good not to introduce the notion of buffering at this level. It can be quite economical. For example in jsont we work directly on byte stream reader provided slices, needing only a temporary 4 bytes buffer for those cases when an UTF-8 encoded character spans two slices.

I don’t think that’s written in the design notes. So here it goes.

In a corner of my head I remember dealing, a long time ago, with a stream system that would allow empty slices. It was quite brittle and messy to coordinate because not everyone agreed on what it meant. Some stream implementations would simply do not expect them and fail or run into infinite loops, other would interpret them as a signal for flushing and other would interpret it as the end of stream. It was unpleasant.

That doesn’t mean you should not allow them. But it indicates that if you allow them you should be very clear about their meaning.

That’s great, indeed. I think it’d be good if a standard byte slice
accepted empty slices without fuss (perhaps with a smart constructor to
return the preallocated eod if the size is zero), but enforcing the
invariant that slices are non empty in readers/writers is a good idea.

And anyway it reflects what lower level APIs do to signal eod, which is
to return 0 as the result of a read or write. :+1:

Anyway, looking forward to seeing how this goes.

1 Like