High memory usage with Lwt

Disclaimer: The following question is stated a bit terribly.

I have a TaskQueue which is Lwt_pool based and constrains the amount of concurrent operations to some low number of operations (irrelevant).

The function getHtml returns a relatively long string.

When I execute few thousands of such operations going greedily down (and thus recursing deeper and deeper) my application starts to eat a lot of memory. I assume that it’s because the promises returned by getHtml are in the memory as long as the bound promise is not resolved. It also makes sense since generally I could attach a callback again to the resolved promise.

Is there a way to have a promise which sort of computes and frees resources afterwards?

I realize I could change the algorithm so that it’s not recursive (and honestly, I probably will) but I’m curious if there’s a simple solution I cannot see.

let rec scrape ~depth ~queue links =
  match links with
  | [] -> Lwt.return false
  | l ->
      pickFirst (List.map
        (fun uri ->
          TaskQueue.submit queue (fun () ->
              let uriString = Uri.to_string uri in
              if isVisited uriString then Lwt.return `Done
              else
                (markVisited uriString ;
                (* Starts a fetch *)
                getHtml uriString
                >>= fun fetchResult ->
                if checkCondition fetchResult then Lwt.return `Success
                else extractLinks fetchResult)
                >>= fun links ->
               (* I don't need the html that's been fetched anymore *)
                match links with
                | `Continue links -> scrape ~depth:(depth + 1) ~queue links
                | `Done -> Lwt.return false
                | `Success -> Lwt.return true ) )
        l)
1 Like

You’re asking for competing tasks, if you want a promise that discards its value, then you can just use Lwt.ignore, however, you still need the result of a promise, thus you need to keep it, until the scheduler passes it to the handler.

In your program, the problem has nothing to do with promises or the way how they are handled or garbage-collected, as you may suspect. Most likely, the problem is that you’re creating lots of promises in parallel, and they all start to work as soon as they created. It could essentially problematic if your consumers (i.e., the extractLinks or checkConditions functions are slower, then the producers (i.e., getHtml) and you’re just unable to process data at the desired rate.

You need to throttle you pipleline. There’re many methods to do this, I usually use a pool of workers and a pool of tasks, implemented around Lwt_sequence.

I do throttle it using Lwt_pool.

Then you’re not showing us enough code. What’s the type of the TaskQueue.submit?
Also, it looks quite suspicious that you’re using the List module for mapping uris to promises, not Lwt_list.

Speculatively, you’re just creating a too big task queue, maybe some site has too many links.

Again, this is only a guess, but it looks like that you’re looking in a wrong direction. Lwt doesn’t introduce any heavy overheads.

TaskQueue.submit is the same as Lwt_pool.use. Why is using map suspicious? I’m using it because then I make an operation on a list of promises (in pickFirst).

You’re right, I skipped one function call, the code was incomplete. I’ve added pickFirst which picks the first promise which resolves to true.

If we assume that a webpage has 10 links on average, after going 3 levels down, there’re going to be 1111 nodes in the resulting tree of promises. Holding them in memory means holding the result of getHtml, too if I understand correctly. And I feel like it’s the root of the problem.

Of course I might be wrong here.

Also, you’re right that I can easily solve it by better “throttling” but it would involve using a more complicated queue instead of this simple recursive algorithm.

Just a brief update: we discussed this issue on Discord back around when it was first posted, and some time after that @wokalski found that the cause was likely not a memory leak in the traditional sense, but rather undesirably large growth of the major heap between major GC cycles. He may have more details at some point in the future :slight_smile: