Hi, I’m trying to write a program that deals with sparse files, and works on 32-bit platforms, but I’m running into some issues with the Unix module from the standard library:
Unix.ftruncate and Unix.lseek take an int argument, but this only gives me a range of 0->2GB.
On Linux this is transparently handled with ftruncate() by dispatching to the ftruncate64 system call, and on BSD off_t is 64 bits, so this is just the standard lib being stupid; the platforms natively support files larger than 2GB.
Does anyone know of a library that exposes these libc functions with larger integers? (int64 will do even if it cuts the maximum length in half).
Bigarray supports memory-mapping of files (which is why it depends on unix on previous versions of OCaml). Some overlays (like my bigstring) build on that.
@c-cubeBigarray's map_file passes PROT_READ | PROT_WRITE to mmap() on Linux, which caues the mmap to fail when you ask for a MAP_SHARED on a file descriptor opened with O_RDONLY.
If you call with MAP_PRIVATE (shared = 0 in the map_file interface) it works, since private mappings of read-only files can be modified, but it has the caveat that it requires allocating the entire size of the file in your process memory. If the file is - say - 30TB - that usually doesn’t work because you don’t have that much ram.
Consider this C program, you cannot do this with Bigarray.Genarray.map_file:
Which is why I asked if someone knew of a library that has bindings to mmap that supports read-only files, since I don’t feel very comfortable C bindings with destructors in OCaml yet
EDIT: Of course this will segfault your application if you actually write to the Bigarray. I don’t understand why OCaml doesn’t have an immutable array type, but that’s another discussion.
I believe mmap with MAP_PRIVATE has copy-on-write semantics (according to manpage). Running a simple tests with a 10GB file (my MacBook has 4GB RAM) reveals that the process allocates at least 10GB of virtual memory, but htop says the physical memory usage is ~600M (page cache?) while the program is running at steady state. I may be misinterpreting, but since my laptop didn’t seem to slow to a halt, I don’t think it actually allocates everything if you’re not writing to the bigarray.
let ba =
let fd = Unix.openfile "ten-gb.dat" [ O_RDONLY ] 0o600 in
Unix.map_file fd Bigarray.char Bigarray.c_layout false [| 10_000_000_000 |]
|> Bigarray.array1_of_genarray
;;
let () =
(* 16-bit sum of bytes in file *)
let sum = ref 0 in
let mask = 0xFFFF_FFFF in
for i = 0 to 9_999_999_999 do
let byte = Char.code ba.{i} in
sum := (!sum + byte) land mask
done;
Printf.printf "sum is %x\n" !sum
;;
@bcc32MAP_PRIVATE does not reserve physical pages on its own until you start modifying pages, for example it is used for global memory for shared libraries. Otherwise this would be a security problem (from /proc/self/maps, where rw-p signifies PROT_READ | PROT_WRITE | MAP_PRIVATE):
The problem is with the userland/kernel break, which is in turn usually configured based on your chipset bus address range; on my system userland has 4TB.
Try mmap’ing a 5TB or 17TB file, unless you’re using some very funky kernel (and processor) that should give you an error.
You can create a sparse file using truncate -s 5TB big.file if you want to test without actually writing 5TB to your disk.
EDIT: Sorry, just to clarify: MAP_SHARED doesn’t count towards this limit since it’s not a private allocation, so it is limited by the virtual address space instead (so on 32bit platforms that still doesn’t get you very far).
Using MAP_PRIVATE was just a hack to work around the problem that I couldn’t mmap read-only files.
@cfcs Ah, I misunderstood the problem then. I’m not able to test your suggestion, but I thought you were referring to running out of physical memory in your previous post. Thanks for clarifying.