I’m trying to write a concurrent web crawler (of sorts) in OCaml, delegating downloading to wget invoked via Unix.open_process_in but the processes are leaked until an invocation dies with EAGAIN. This program repros the problem:
let hostname() =
let ch = Unix.open_process_in "/bin/hostname" in
let wd = input_line ch in
let () =
for i=1 to 10000 do
which dies with:
Fatal error: exception Unix.Unix_error(Unix.EAGAIN, "create_process", "/bin/sh")
I believe the child process’ PID needs to be wait’ed on for it to end so I assume this isn’t happening but I don’t have the process, just an in_channel and all I can do is close it which I have done and yet I still appear to be leaking.
How should this be coded to ensure the child processes are terminated appropriately?
Hi. I think you’re supposed to use
Unix.close_process_in to close the fd returned by
in addition to @emillon, I wouldn’t call
hostname() in loop 1000x yielding the same result but do it once outside.
I recently did a crawler and focused the OCaml part on content extraction and did it as a Unix filter reading stdin, writing stdout. All the rest is
Another question here is, why use an out-of-process
wget instead of a concurrent web client library like say Piaf?
That’s it! No more leaking processes, thanks.
That was the minimal repro of such a leak. I’m not doing that in the actual solution. I’m calling wget 1.6 million times with different arguments and it was dying <<1% of the way through. It was easier to demonstrate with something benign like hostname.
the code simplicity seems unbeatable.
It’s 2 nested curl statements, about 6 LOC. All state is external and visible. The Unix filter is easy to test and re-run (idempotent) without side effects with all benefits.
I’ve tried other web client libraries both in OCaml (e.g. Cohttp) and elsewhere and keep hitting the problem that they don’t spoof browsers either at all or not very well and, consequently, get kicked by servers.
Also, maturity. How many users does Linux’ wget have vs OCaml’s Piaf library (v0.1 released 11 days ago)?