Leaking processes

Jon_Harrop · February 16, 2022, 1:45pm

I’m trying to write a concurrent web crawler (of sorts) in OCaml, delegating downloading to wget invoked via Unix.open_process_in but the processes are leaked until an invocation dies with EAGAIN. This program repros the problem:

let hostname() =
  let ch = Unix.open_process_in "/bin/hostname" in
  let wd = input_line ch in
  close_in ch;
  wd

let () =
  for i=1 to 10000 do
    ignore(hostname())
  done

which dies with:

Fatal error: exception Unix.Unix_error(Unix.EAGAIN, "create_process", "/bin/sh")

I believe the child process’ PID needs to be wait’ed on for it to end so I assume this isn’t happening but I don’t have the process, just an in_channel and all I can do is close it which I have done and yet I still appear to be leaking.

How should this be coded to ensure the child processes are terminated appropriately?

emillon · February 16, 2022, 1:47pm

Hi. I think you’re supposed to use Unix.close_process_in to close the fd returned by Unix.open_process_in.

mro · February 16, 2022, 6:40pm

in addition to @emillon, I wouldn’t call hostname() in loop 1000x yielding the same result but do it once outside.

I recently did a crawler and focused the OCaml part on content extraction and did it as a Unix filter reading stdin, writing stdout. All the rest is sh and curl. http://mro.name/radio-privatkopie

yawaramin · February 16, 2022, 7:01pm

Another question here is, why use an out-of-process wget instead of a concurrent web client library like say Piaf?

Jon_Harrop · February 16, 2022, 8:15pm

That’s it! No more leaking processes, thanks.

Jon_Harrop · February 16, 2022, 8:19pm

That was the minimal repro of such a leak. I’m not doing that in the actual solution. I’m calling wget 1.6 million times with different arguments and it was dying <<1% of the way through. It was easier to demonstrate with something benign like hostname.

mro · February 16, 2022, 8:20pm

the code simplicity seems unbeatable.

It’s 2 nested curl statements, about 6 LOC. All state is external and visible. The Unix filter is easy to test and re-run (idempotent) without side effects with all benefits.

Jon_Harrop · February 16, 2022, 8:21pm

I’ve tried other web client libraries both in OCaml (e.g. Cohttp) and elsewhere and keep hitting the problem that they don’t spoof browsers either at all or not very well and, consequently, get kicked by servers.

Also, maturity. How many users does Linux’ wget have vs OCaml’s Piaf library (v0.1 released 11 days ago)?

Topic		Replies	Views
Unix.open_process does not expose pid of child Learning unix	1	1193	November 13, 2017
Core.Unix.create_process: How can I see the output? Learning	1	1031	March 27, 2020
How to run external command from OCaml script Learning	10	3354	April 16, 2023
[ANN] Subprocess: a library for launching and communicating with Unix commands Ecosystem announce	8	484	April 15, 2025
Deprecation of Lwt_unix.set_default_async_method Learning	9	838	November 16, 2022

Leaking processes

Related topics