What's the best way to save and plot an huge amount of data in a file

Butanium · December 12, 2021, 9:49pm

Context

I’m making a heuristic algorithm to solve the travelling salesman problem.
I’m in a Windows environment using diskuv-ocaml

My algorithm does in its fastest mode ~20-25 millions simulations and at each simulation I want to save the length found and the timestamps.

Current solution

For now, I store data in a int*float list - (length, time) - in order to not slow the algorithm in runtime. After it finishes its run (~30 minutes) I save the data in a .txt file.

Then I plot the data in the .txt file using python and Matplotlib. But it’s very memory expensive and takes a long time.

Envisaged solution

I could use a library like vaex instead of Matplotlib, but in order to do that I would need to create some .hdf5 or Apashe Arrow files instead of txt.
The best thing I could think of is to convert my list to an array after the algorithm finish and then use this hdf5 package to use vaex in a python script.

Do you think that’s the best way to do it ?

yawaramin · December 12, 2021, 11:11pm

I assume it’s memory expensive because you need to read the entire txt file into memory before plotting? Another option might be to save all the data into a SQLite DB and load and plot the data from there in batches.

In fact having the data in a SQL DB gives you more options, like doing a query to roll up the data using statistical (window) functions and then plot the aggregated data for even faster plots.

Chet_Murthy · December 12, 2021, 11:28pm

From your description, it sounds like your algorithm is run over-and-over on simulated input, each time generating this four-tuple “length,time;length,time” ? If so, then your algorithm isn’t the memory-expensive thing, yes? B/c it dumps out a four-tuple, then goes around and does another run, which doesn’t need to look at previous outputs, yes?

In which case, I would first find a plotting package that performs adequately, not worrying about what input format it needs. And once I found one that performed adequately, I would write a converter from my output format to that input format. Eventually, I might change my code to output in that format, but it wouldn’t be a major concern, since the conversion should also be cheap, memory-wise.

Niklaus · December 13, 2021, 9:24am

Unless mistaken 25M rows is a lot of plotting for even for a {5k}^2 monitor, how do you smooth it out ? If not already you might want to chew it a bit.
If you are looking for a subset a database is probably the right choice, as already mentioned above.

The first thing you could do is to write (in ocaml) a binary file (2x8x25M = 400MB) that is readable by a package such as NumPy and Mathplotlib. Various flat formats are available (caution about endianness) in addition to the venerable CSV.

The second step would be to partition and generate an histogram (with a binwidth/partition) or even using a kernel density function. So you are not plotting 25M points - which also raises the issue of data being equal - but a summary.

see Histograms and Density Plots in Python | by Will Koehrsen | Towards Data Science but examples shown are single dimension only.
However I believe your data is two dimensional (length and time) but I am less familiar with heatmaps or two dimensional partitions, bubble charts,… google is your friend.

Since this is ocaml here, a let p = fun elmt partition -> partition in fold ... counting function could do it before writing to disk, especially if you don’t need to keep the raw data (probably an array).

c-cube · December 13, 2021, 1:40pm

What a coincidence, I wrote an
Avro library very recently. The
paint is still fresh. However, it might be worth giving it a try as it’s
exactly the targeted use case: many rows of relatively simple data,
encoded as binary; it also supports gzip compression (per “block” of N
many rows, with N configurable). And there’s no need to worry about
endianess.

It typically uses code generation from a schema (a json file).

There’s libraries for Avro in java (with all the Spark ecosystem) and
also python (see “fastavro”).

Butanium · December 13, 2021, 7:54pm

do you have any resources I could look at in order to do that ?

Butanium · December 13, 2021, 7:57pm

I’m using a moving average (10000 converted in 1 point representing the average length and time) to make less plots. But doing it with python is very expensive for the 25M input

mars0i · December 13, 2021, 8:18pm

You might want to look at Owl, which includes wrappers for some plot libraries.

I didn’t have a good experience with the OCaml hdf5 package. It might be worth a try, and might work for you, but some functionality is broken, it’s not well documented, and it hasn’t been maintained. Maybe at some point someone will fork the repo to develop it further.

Butanium · December 13, 2021, 8:24pm

vaex looks like the package I need. I’m not sure if I should store my data in a list while running or if I should do it in another way

Chet_Murthy · December 13, 2021, 8:35pm

Again, I would start with “what package and input format do the job best” and then arrange for my algorithm to generate in that format. For sure, I doubt that a list is your best choice: you might want to switch to using arrays, at a minimum. But really, getting the stuff out-of-heap quickly would be best, and if your plotting package runs in a different process from your algorithm, that shouldn’t be problematic.

Butanium · December 13, 2021, 10:10pm

How can I use an Avro file to plot my results ?

c-cube · December 14, 2021, 12:56am

You can use your favorite python plotting library and fastavro.

devosalain · December 14, 2021, 4:44am

Store it in jsonb format in postgresql database?

sanette · December 14, 2021, 3:28pm

Why don’t you do the moving average in ocaml and save the resulting data? It should be very fast, and then plotting the result with your favorite plotting library would be fast too.

yawaramin · December 14, 2021, 3:33pm

Sure. Here are some resources:

OCaml SQLite library: GitHub - mmottl/sqlite3-ocaml: OCaml bindings to the SQLite3 database
SQLite window functions: Window Functions
SQLite window functions for calculating moving average: Understanding Window Frame in SQLite By Practical Examples

EDIT: the nice thing is that you can test out all the calculations using just the sqlite3 CLI tool before you actually implement them in code.

Topic		Replies	Views
Help with algorithms/datastructures/libraries for 3 dimensional data Learning	6	1276	May 2, 2018
OCaml-vega-lite and OCaml dataframes Community announce	17	4053	November 28, 2017
How to optimize the memory consumption in OCaml? Learning performance , datastructures , data-structure	8	744	April 1, 2024
What do you use to plot a list of (x,y) coordinates Learning	11	1338	June 29, 2022
Changing OCaml's allocation model when creating new values and accessing data Learning	15	1409	December 10, 2018

What's the best way to save and plot an huge amount of data in a file

Context

Current solution

Envisaged solution

Related topics