What a coincidence, I wrote an
Avro library very recently. The
paint is still fresh. However, it might be worth giving it a try as it’s
exactly the targeted use case: many rows of relatively simple data,
encoded as binary; it also supports gzip compression (per “block” of N
many rows, with N configurable). And there’s no need to worry about
endianess.
It typically uses code generation from a schema (a json file).
There’s libraries for Avro in java (with all the Spark ecosystem) and
also python (see “fastavro”).