Testing with different number of lines of the same file

Hey, I have a 10 million line file I run tests with. I test with different number of lines there, e.g. first with 1k lines, then 100k lines, then 1m lines. Rather than creating several files with different amount of lines there, what would be a feasible way of achieving this? I was thinking of using a counter, or maybe reading the all the lines of the file and then taking the first 100k. However, 1m looks like to be a lot.

A counter would work. What you might’ve tried and had problems with is breaking out of a loop, like this does:

let fold_take (type acc) (f : acc -> string -> acc) (init : acc) limit ic =
  (* always reads once *)
  assert (limit > 0);

  let exception Return of acc in
  try
    In_channel.fold_lines
      (fun (n, acc) line ->
        let acc = f acc line in
        if n >= limit then raise (Return acc);
        n + 1, acc)
      (1, init) ic
    |> Pair.snd
  with Return v -> v

usage:

# fold_take (fun () line -> print_endline ("::" ^ line)) () 2 stdin;;
1
::1
2
::2
- : unit = ()

Which comes up a few times in What is the programming pattern with multiple if else branching - #5 by jbeckford

Stdlib I/O buffers aggressively so reading line-by-line should be fine for performance. I’d avoid reading the file entirely into memory to then only take n lines. An advanced option is probably memory-mapping the file and working with (index, length) pairs and bingstringaf. Or, using that:

open Angstrom

let line = take_while (( <> ) '\n') <* char '\n'
let iter_lines f = many (line >>| f) *> return ()

let () =
  let fd = Unix.openfile Sys.argv.(1) [O_RDONLY] 0 in
  let st = Unix.fstat fd in
  let map =
    Bigarray.array1_of_genarray
      (Unix.map_file fd Bigarray.char Bigarray.c_layout false [|st.st_size|])
  in
  Angstrom.parse_bigstring ~consume:Angstrom.Consume.All
    (iter_lines print_endline) map
  |> Result.get_ok
2 Likes

Thank you, Julian, this works. #resolved