I initially planned to do this by writing a function which uses the Eio API:
let cp ?(debug = false) ~source ~dest () =
let debug_info =
if debug then (fun p -> Eio.Std.traceln "%a" Eio.Path.pp p)
else Fun.const ()
in
let debug_error =
if debug then (fun p ex -> Eio.Std.traceln "%a: %a" Eio.Path.pp p Eio.Exn.pp ex)
else (fun _ _ -> ())
in
let rec aux ~source ~dest =
match Path.kind ~follow:false source with
| `Directory ->
begin
match Path.mkdir dest ~perm:0o700 with
| () -> debug_info dest
| exception ex -> debug_error dest ex
end;
Path.read_dir source |> List.iter (function
| item when String.starts_with ~prefix:"." item -> ()
| item -> aux ~source:(source / item) ~dest:(dest / item)
)
| `Regular_file ->
begin
(* Switch.run @@ fun sw -> *)
Path.with_open_in source @@ fun source ->
Path.with_open_out ~create:(`Or_truncate 0o700) dest @@ fun dest ->
Flow.copy source dest
end;
()
| _ -> ()
in
begin
match Path.mkdir dest ~perm:0o700 with
| () -> debug_info dest
| exception (ex : exn) -> debug_error dest ex
end;
aux ~source ~dest
Now, unfortunately this function is very slow and I am now questioning my approach.
(A 114MB folder takes around a minute)
I donât know how operating systems implement copy, but I feel like my implementation will necessarily be less efficient. I doubt that operating systems explicitly traverse and copy like this.
Are there any optimizations I can do to my code to make the speed acceptable, or is it a better approach to test for the kind of operating system, and then call out to external utilities?
You could attempt to spawn a new fiber for each file copy (up to a limit), on an SSD/NVME drive that should speed up copies.
You can also try to separate directories from files when traversing, and first process files rather than directories. Otherwise you may end up exploring a very deeply nested directory tree before performing any copies.
I remember trying a nested directory lister with domainslibs and it was quite fast (faster than find, or rg --files). Iâll have to see whether I still have the code.
To add some context, I need this in the context of a simple build system (for forester).
I simply replaced List.iter with Eio.Fiber.List.iter and ran into a too many open files error. Ignoring the directory which causes this (node_modules), using Fibers offers a 30 second speedup:
16.869218s vs 44.092870s
Now, I obviously need to further work on this so that the error canât occur, but this is already a huge improvement!
I tried various values: 10, 100, 500, 1000. There were about 5 seconds of variation. Low values of max_fibers cause the function to succeed when copying the large node_modules folder, but itâs slower.
it might be worth reading the source code of cp(1).
Failing that, it might be worth strace()-ing a run of cp -r to see what it does in the way of syscalls.
At a minimum, before looking to employing I/O concurrency, itâs probably worth checking whether UNIX cp does so. Similarly, for parallelism.
it might be worth dividing your problem into subproblems, of which there are at least two that jump out:
3.a. copying a large file. Copy files of increasing size (power of 2) with cp(1) and your code, see what the difference is. If itâs large, thatâs worth working on.
3.b. copying small files â increasing numbers of them. make a directory of (letâs say 1k) files, and copy that directory. Make the # of files a power of 2. again, see what the difference between cp -r and your code is; if itâs large, thatâs worth working on.
Itâs probably worth checking these for the three different cases:
4.a. all data fits handily in memory (so youâre not actually I/O-bound)
4.b. your persistent device is an SSD
4.c. your persistent device is an HDD
The reason 4.b and 4.c might (might â itâs been a long time since I worked on I/O-intensive codes) be different is that for HDDs, you sometimes need to present much more in the way of concurrent writes and reads in order to fully-saturate the drive (b/c of the built-in elevator in the drive firmware). I think the term of art used to be âqueue depthâ back in the day.
But really, it shouldnât be necessary to go to step #4: I think that #1-#3 should be enough to identify whatâs going wrong.
P.S. And maybe step #0 is just to check whether your âcp cloneâ is I/O-bound, or if itâs consuming more CPU than you think reasonable. For starters, you should be reusing memory-buffers in such a way that for a large file, the GC never kicks in, regardless of how big the file is; and ideally that should be true for a directory full of files too, as long as the number of such files is something reasonable (that is to say, maybe you use GCed storage for the filenames and book-keeping data, but not for the contents of files as theyâre being copied).
It is shocking how often the source of performance problems in GCed systems code comes down to âdidnât reuse buffersâ.
We do quite a lot of copying in Dune as well. The 2 pieces of advice I have would be:
profile your program (e.g., make a flamegraph from a run when itâs longer than a couple seconds). youâll immediately see if it spends a lot of time in the GC, for example.
there are some platform-specific APIs you can use to let the kernel do the copy by itself without the data being copied to userspace. in Dune we use sendfile on Linux, I believe eio will try to do something similar but maybe you have to use eio_linux and from Hang in eio_linux tests ¡ Issue #319 ¡ ocaml-multicore/eio ¡ GitHub, maybe thatâs disabled?
Also âstraceâ your eio code (with the posix backend).
Also have you measured the performance of copying a single file with both the âuringâ and posix backends of Eio (there is an env var you can use to switch between them)? There is probably quite a lot that can be tweaked there.
A quick look at Eio code shows these inefficiencies:
try
let buf = Cstruct.create 4096 in
while true do
let got = Src.single_read src buf in
(similar code in POSIX and Eio backends in flow.ml)
Also Eio has an optional block_size=4096 in sched.ml (but also n_blocks=64, so probably OK)
IIRC OCamlâs Unix module would accept up to 64k in one go, so increasing this buffer size might help reducing the number of syscalls.
Yes, maybe the default block size should be changed. You can always write a loop yourself with whatever size you like.
Iâd also suggest running eio-trace on a small example. We havenât profiled the FS stuff at all yet and thereâs probably some more low-hanging fruit there!
Ideally, using fs (unconfined access) should be as fast as passing string pathnames around, and using confined access (which is safer) might be even faster if you always use open_dir so that paths the kernel has to resolve are only ever one step. I wouldnât be surprised if itâs doing something inefficient in some cases at the moment though.