How to compare Lists

How can I compare more than two lists and return the longest from them? (Is it possible when I do not know an exact number of lists?)

you can save both the current longest list and its length so you only need to iterate once.

Naive idea: pass a list of lists, recursively compute the length of each element and save the max into an accumulator variable.

You mean you need to find the longest list from a list of lists? Then you just need to find the max element, e.g.,

(* file demo.ml *)
open Base

let longest = List.max_elt ~compare:(fun x y ->
    compare (List.length x) (List.length y))

To build use (once)

dune init exe demo --libs base

To run use

dune exec ./demo.exe
1 Like

Maybe you want to compute lengths lazily, for instance if you have one list that is much longer than the others. Essentially, you’d want a generalization of List.compare_lengths. You may do it like this.

(* Return the index of the (first) longest list. *)
let longest_list : 'a list list -> int =
  let pop_one tail =
    begin match tail with
    | (_, [])    -> None
    | (i, _::tl) -> Some (i, tl)
    end
  in
  let rec longest tails =
    begin match tails with
    (* no candidate remaining: won’t happen *)
    | [] ->
        assert false
    (* only one candidate remaining: we have a winner *)
    | (i, _) :: [] ->
        i
    (* at least two candidates remaining: continue selection *)
    | (i, _) :: _ ->
        (* eliminate empty candidates, pop an element from others: *)
        begin match List.filter_map pop_one tails with
        (* if we have just eliminated all remaining candidates, then
         * then they were all the same length; we return the first one
         * (not fair!) *)
        | []     -> i
        (* otherwise, continue selection *)
        | tails' -> longest tails'
        end
    end
  in
  fun lists ->
    begin match lists with
    | [] -> raise Not_found (* cannot find the longest of no lists *)
    | _  -> longest (List.mapi (fun i li -> (i, li)) lists)
    end

The idea of this code is to repeatedly pop an element from each list, eliminating tails as soon as they become empty. When only one tail remains, that is the tail of your longest list. In order to recover the full list at that step, you need to have paired the tails with some bit of info. Here I pair them with the index of their corresponding list in the list of lists but, depending on your needs, could pair it with the list itself (using (fun _ li -> (li, li))).

Wouldn’t you take advantage of List.compare_lengths here? (In the standard library since OCaml 4.05).

open Base

let longest = List.max_elt ~compare:List.compare_lengths

Thank you everyone <3

I am not sure that this function is in the Base's List, iirc it is not.

Should be then available as Caml.List.compare_lengths.

It’s worth noting, @Maelan, that while worrying about the cost of List.length, the function you’ve introduced to replace it does masses of allocation! You can generalise List.compare_lengths without needing Base; max_elt is just a fold:

let biggest_list = function
| [] -> invalid_arg "biggest_list"
| lists ->
    let f max list =
      if List.compare_lengths list max > 0 then
        list
      else
        max
    in
    List.fold_left f [] lists

You can get a sense of the allocations going on in your function by compiling code with -dlambda and looking for makeblock calls (there’s always one at the end of a module for the module itself): e.g. the above biggest_list:

(let
  (biggest_list/80 =
    (function lists/82
      (if lists/82
        (let
          (f/83 =
             (function max/85 list/86
               (if
                 (>
                    (apply (field 1 (global Stdlib__list!)) list/86 max/85)
                   0)
                 list/86 max/85)))
          (apply (field 22 (global Stdlib__list!)) f/83 0a lists/82))
        (apply (field 0 (global Stdlib!)) "biggest_list"))))

compared, for example, with your pop_one:

(let
  (pop_one/81 =
     (function tail/83
        (let (*match*/161 =a (field 1 tail/83))
           (if *match*/161
             (makeblock 0
               (makeblock 0 (field 0 tail/83) (field 1 *match*/161)))
             0a))))
1 Like

I would advise against mixing two standard libraries. I also strive for simplicity and readability and don’t think that novice users need to be overwhelmed by different variants of libraries and premature optimizations of such a simple function.

1 Like

Indeed! My function minimizes the number of cons cells read (so has the right asymptotic complexity), but it does keep allocating lists and pairs (options do not really matter, I hope they are optimized away with enough inlining). We can avoid it using mutable fields and a mutable linked list, but… at that point I don’t think there is any advantage over your function.

Initially I thought that, maybe, folding with List.compare_lengths might have a worse complexity, but now that I think about it, I realize it is just a factor 2*. So your allocation-less function is much better.

[*: If L is the length of the longest list, and ℓ1, ℓn are the lengths of all lists but the longest one, then the number of cons cells read by my function is ℓ1 + … + ℓn + max(ℓ1, …, ℓn), while with your function it is 2ℓ1 + … + 2ℓn, because List.compare_lengths xs ys reads 2 * min (length xs) (length ys) cons cells.]

Your complexity analysis ignores the call to List.filter_map in your function - it builds a fresh list of lists each time with one element removed from each list (and empty lists then filtered). That’s expensive - much more expensive than traversing the lists.

A micro-benchmark: two random lists of lists, both with a total of million items (i.e. same number of cons cells). l1 has 40730 lists where the largest individual list is 49 elements long; l2 has just 35 lists where the largest individual list is 49813 elements. Calling that 100 times with each function gives something comparable for me.

In bytecode, biggest_list takes roughly 2.5 seconds on l1 and 1.5s on l2 and performs no allocations so is unimpeded by the GC. longest_list takes 21s on l1 (and does 4389 minor collections and 200 major collections) and 6s on l2 (4264 minor collections and 0 major collections).

Native code doesn’t affect the relative performance of the functions, and neither benefits from flambda with -O3.