I’m trying to write a concurrent web crawler (of sorts) in OCaml, delegating downloading to wget invoked via Unix.open_process_in but the processes are leaked until an invocation dies with EAGAIN. This program repros the problem:
let hostname() =
let ch = Unix.open_process_in "/bin/hostname" in
let wd = input_line ch in
close_in ch;
wd
let () =
for i=1 to 10000 do
ignore(hostname())
done
I believe the child process’ PID needs to be wait’ed on for it to end so I assume this isn’t happening but I don’t have the process, just an in_channel and all I can do is close it which I have done and yet I still appear to be leaking.
How should this be coded to ensure the child processes are terminated appropriately?
in addition to @emillon, I wouldn’t call hostname() in loop 1000x yielding the same result but do it once outside.
I recently did a crawler and focused the OCaml part on content extraction and did it as a Unix filter reading stdin, writing stdout. All the rest is sh and curl. http://mro.name/radio-privatkopie
That was the minimal repro of such a leak. I’m not doing that in the actual solution. I’m calling wget 1.6 million times with different arguments and it was dying <<1% of the way through. It was easier to demonstrate with something benign like hostname.
It’s 2 nested curl statements, about 6 LOC. All state is external and visible. The Unix filter is easy to test and re-run (idempotent) without side effects with all benefits.
I’ve tried other web client libraries both in OCaml (e.g. Cohttp) and elsewhere and keep hitting the problem that they don’t spoof browsers either at all or not very well and, consequently, get kicked by servers.
Also, maturity. How many users does Linux’ wget have vs OCaml’s Piaf library (v0.1 released 11 days ago)?