What's the best way to save and plot an huge amount of data in a file

Context

I’m making a heuristic algorithm to solve the travelling salesman problem.
I’m in a Windows environment using diskuv-ocaml

My algorithm does in its fastest mode ~20-25 millions simulations and at each simulation I want to save the length found and the timestamps.

Current solution

For now, I store data in a int*float list - (length, time) - in order to not slow the algorithm in runtime. After it finishes its run (~30 minutes) I save the data in a .txt file.

Then I plot the data in the .txt file using python and Matplotlib. But it’s very memory expensive and takes a long time.

Envisaged solution

I could use a library like vaex instead of Matplotlib, but in order to do that I would need to create some .hdf5 or Apashe Arrow files instead of txt.
The best thing I could think of is to convert my list to an array after the algorithm finish and then use this hdf5 package to use vaex in a python script.

Do you think that’s the best way to do it ?

I assume it’s memory expensive because you need to read the entire txt file into memory before plotting? Another option might be to save all the data into a SQLite DB and load and plot the data from there in batches.

In fact having the data in a SQL DB gives you more options, like doing a query to roll up the data using statistical (window) functions and then plot the aggregated data for even faster plots.

1 Like

From your description, it sounds like your algorithm is run over-and-over on simulated input, each time generating this four-tuple “length,time;length,time” ? If so, then your algorithm isn’t the memory-expensive thing, yes? B/c it dumps out a four-tuple, then goes around and does another run, which doesn’t need to look at previous outputs, yes?

In which case, I would first find a plotting package that performs adequately, not worrying about what input format it needs. And once I found one that performed adequately, I would write a converter from my output format to that input format. Eventually, I might change my code to output in that format, but it wouldn’t be a major concern, since the conversion should also be cheap, memory-wise.

Unless mistaken 25M rows is a lot of plotting for even for a {5k}^2 monitor, how do you smooth it out ? If not already you might want to chew it a bit.
If you are looking for a subset a database is probably the right choice, as already mentioned above.

The first thing you could do is to write (in ocaml) a binary file (2x8x25M = 400MB) that is readable by a package such as NumPy and Mathplotlib. Various flat formats are available (caution about endianness) in addition to the venerable CSV.

The second step would be to partition and generate an histogram (with a binwidth/partition) or even using a kernel density function. So you are not plotting 25M points - which also raises the issue of data being equal - but a summary.

see Histograms and Density Plots in Python | by Will Koehrsen | Towards Data Science but examples shown are single dimension only.
However I believe your data is two dimensional (length and time) but I am less familiar with heatmaps or two dimensional partitions, bubble charts,… google is your friend.

Since this is ocaml here, a let p = fun elmt partition -> partition in fold ... counting function could do it before writing to disk, especially if you don’t need to keep the raw data (probably an array).

What a coincidence, I wrote an
Avro library very recently. The
paint is still fresh. However, it might be worth giving it a try as it’s
exactly the targeted use case: many rows of relatively simple data,
encoded as binary; it also supports gzip compression (per “block” of N
many rows, with N configurable). And there’s no need to worry about
endianess.

It typically uses code generation from a schema (a json file).

There’s libraries for Avro in java (with all the Spark ecosystem) and
also python (see “fastavro”).

1 Like

do you have any resources I could look at in order to do that ?

I’m using a moving average (10000 converted in 1 point representing the average length and time) to make less plots. But doing it with python is very expensive for the 25M input

You might want to look at Owl, which includes wrappers for some plot libraries.

I didn’t have a good experience with the OCaml hdf5 package. It might be worth a try, and might work for you, but some functionality is broken, it’s not well documented, and it hasn’t been maintained. Maybe at some point someone will fork the repo to develop it further.

3 Likes

vaex looks like the package I need. I’m not sure if I should store my data in a list while running or if I should do it in another way

Again, I would start with “what package and input format do the job best” and then arrange for my algorithm to generate in that format. For sure, I doubt that a list is your best choice: you might want to switch to using arrays, at a minimum. But really, getting the stuff out-of-heap quickly would be best, and if your plotting package runs in a different process from your algorithm, that shouldn’t be problematic.

How can I use an Avro file to plot my results ?

You can use your favorite python plotting library and fastavro.

Store it in jsonb format in postgresql database?

Why don’t you do the moving average in ocaml and save the resulting data? It should be very fast, and then plotting the result with your favorite plotting library would be fast too.

Sure. Here are some resources:

EDIT: the nice thing is that you can test out all the calculations using just the sqlite3 CLI tool before you actually implement them in code.

3 Likes