Hey, I have a 10 million line file I run tests with. I test with different number of lines there, e.g. first with 1k lines, then 100k lines, then 1m lines. Rather than creating several files with different amount of lines there, what would be a feasible way of achieving this? I was thinking of using a counter, or maybe reading the all the lines of the file and then taking the first 100k. However, 1m looks like to be a lot.
A counter would work. What you might’ve tried and had problems with is breaking out of a loop, like this does:
let fold_take (type acc) (f : acc -> string -> acc) (init : acc) limit ic =
(* always reads once *)
assert (limit > 0);
let exception Return of acc in
try
In_channel.fold_lines
(fun (n, acc) line ->
let acc = f acc line in
if n >= limit then raise (Return acc);
n + 1, acc)
(1, init) ic
|> Pair.snd
with Return v -> v
usage:
# fold_take (fun () line -> print_endline ("::" ^ line)) () 2 stdin;;
1
::1
2
::2
- : unit = ()
Which comes up a few times in What is the programming pattern with multiple if else branching - #5 by jbeckford
Stdlib I/O buffers aggressively so reading line-by-line should be fine for performance. I’d avoid reading the file entirely into memory to then only take n lines. An advanced option is probably memory-mapping the file and working with (index, length) pairs and bingstringaf. Or, using that:
open Angstrom
let line = take_while (( <> ) '\n') <* char '\n'
let iter_lines f = many (line >>| f) *> return ()
let () =
let fd = Unix.openfile Sys.argv.(1) [O_RDONLY] 0 in
let st = Unix.fstat fd in
let map =
Bigarray.array1_of_genarray
(Unix.map_file fd Bigarray.char Bigarray.c_layout false [|st.st_size|])
in
Angstrom.parse_bigstring ~consume:Angstrom.Consume.All
(iter_lines print_endline) map
|> Result.get_ok
2 Likes
Thank you, Julian, this works. #resolved