How to write an efficient 'cp -R' clone with Eio?

In order to make the program I am working on portable to windows, I need to replace this function:

let copy_to_dir ~env ~cwd ~source ~dest_dir =
  run_process ~quiet:true ~env ~cwd ["cp"; "-R"; source; dest_dir ^ "/" ]

I initially planned to do this by writing a function which uses the Eio API:

let cp ?(debug = false) ~source ~dest () =
  let debug_info = 
    if debug then (fun p -> Eio.Std.traceln "%a" Eio.Path.pp p) 
    else Fun.const () 
  let debug_error = 
    if debug then (fun p ex -> Eio.Std.traceln "%a: %a" Eio.Path.pp p Eio.Exn.pp ex)
    else (fun _ _ -> ())
  let rec aux ~source ~dest =
    match Path.kind ~follow:false source with
    | `Directory ->
        match Path.mkdir dest ~perm:0o700 with
        | () -> debug_info dest
        | exception ex -> debug_error dest ex
      Path.read_dir source |> List.iter (function
        | item when String.starts_with ~prefix:"." item -> ()
        | item -> aux ~source:(source / item) ~dest:(dest / item)
    | `Regular_file ->
        (* @@ fun sw -> *)
        Path.with_open_in source @@ fun source -> 
        Path.with_open_out ~create:(`Or_truncate 0o700) dest @@ fun dest ->
        Flow.copy source dest
    | _ -> ()
  match Path.mkdir dest ~perm:0o700 with
    | () -> debug_info dest
    | exception (ex : exn) -> debug_error dest ex
  aux ~source ~dest

Now, unfortunately this function is very slow and I am now questioning my approach.
(A 114MB folder takes around a minute)

I don’t know how operating systems implement copy, but I feel like my implementation will necessarily be less efficient. I doubt that operating systems explicitly traverse and copy like this.

Are there any optimizations I can do to my code to make the speed acceptable, or is it a better approach to test for the kind of operating system, and then call out to external utilities?

You could attempt to spawn a new fiber for each file copy (up to a limit), on an SSD/NVME drive that should speed up copies.

You can also try to separate directories from files when traversing, and first process files rather than directories. Otherwise you may end up exploring a very deeply nested directory tree before performing any copies.

I remember trying a nested directory lister with domainslibs and it was quite fast (faster than find, or rg --files). I’ll have to see whether I still have the code.


To add some context, I need this in the context of a simple build system (for forester).

I simply replaced List.iter with Eio.Fiber.List.iter and ran into a too many open files error. Ignoring the directory which causes this (node_modules), using Fibers offers a 30 second speedup:

16.869218s vs 44.092870s

Now, I obviously need to further work on this so that the error can’t occur, but this is already a huge improvement!


What did you use for the max_fibers value?

I tried various values: 10, 100, 500, 1000. There were about 5 seconds of variation. Low values of max_fibers cause the function to succeed when copying the large node_modules folder, but it’s slower.


Several thoughts:

  1. it might be worth reading the source code of cp(1).

  2. Failing that, it might be worth strace()-ing a run of cp -r to see what it does in the way of syscalls.

    At a minimum, before looking to employing I/O concurrency, it’s probably worth checking whether UNIX cp does so. Similarly, for parallelism.

  3. it might be worth dividing your problem into subproblems, of which there are at least two that jump out:
    3.a. copying a large file. Copy files of increasing size (power of 2) with cp(1) and your code, see what the difference is. If it’s large, that’s worth working on.
    3.b. copying small files – increasing numbers of them. make a directory of (let’s say 1k) files, and copy that directory. Make the # of files a power of 2. again, see what the difference between cp -r and your code is; if it’s large, that’s worth working on.

  4. It’s probably worth checking these for the three different cases:
    4.a. all data fits handily in memory (so you’re not actually I/O-bound)
    4.b. your persistent device is an SSD
    4.c. your persistent device is an HDD

The reason 4.b and 4.c might (might – it’s been a long time since I worked on I/O-intensive codes) be different is that for HDDs, you sometimes need to present much more in the way of concurrent writes and reads in order to fully-saturate the drive (b/c of the built-in elevator in the drive firmware). I think the term of art used to be “queue depth” back in the day.

But really, it shouldn’t be necessary to go to step #4: I think that #1-#3 should be enough to identify what’s going wrong.

P.S. And maybe step #0 is just to check whether your “cp clone” is I/O-bound, or if it’s consuming more CPU than you think reasonable. For starters, you should be reusing memory-buffers in such a way that for a large file, the GC never kicks in, regardless of how big the file is; and ideally that should be true for a directory full of files too, as long as the number of such files is something reasonable (that is to say, maybe you use GCed storage for the filenames and book-keeping data, but not for the contents of files as they’re being copied).

It is shocking how often the source of performance problems in GCed systems code comes down to “didn’t reuse buffers”.


We do quite a lot of copying in Dune as well. The 2 pieces of advice I have would be:

  • profile your program (e.g., make a flamegraph from a run when it’s longer than a couple seconds). you’ll immediately see if it spends a lot of time in the GC, for example.
  • there are some platform-specific APIs you can use to let the kernel do the copy by itself without the data being copied to userspace. in Dune we use sendfile on Linux, I believe eio will try to do something similar but maybe you have to use eio_linux and from Hang in eio_linux tests · Issue #319 · ocaml-multicore/eio · GitHub, maybe that’s disabled?
1 Like

Also ‘strace’ your eio code (with the posix backend).

Also have you measured the performance of copying a single file with both the ‘uring’ and posix backends of Eio (there is an env var you can use to switch between them)? There is probably quite a lot that can be tweaked there.

A quick look at Eio code shows these inefficiencies:

      let buf = Cstruct.create 4096 in
      while true do
        let got = Src.single_read src buf in

(similar code in POSIX and Eio backends in
Also Eio has an optional block_size=4096 in (but also n_blocks=64, so probably OK)

IIRC OCaml’s Unix module would accept up to 64k in one go, so increasing this buffer size might help reducing the number of syscalls.

Yes, maybe the default block size should be changed. You can always write a loop yourself with whatever size you like.

I’d also suggest running eio-trace on a small example. We haven’t profiled the FS stuff at all yet and there’s probably some more low-hanging fruit there!

Ideally, using fs (unconfined access) should be as fast as passing string pathnames around, and using confined access (which is safer) might be even faster if you always use open_dir so that paths the kernel has to resolve are only ever one step. I wouldn’t be surprised if it’s doing something inefficient in some cases at the moment though.

Another dirty option could be to call a Windows-specific program such as robocopy.