Can performance be improved for this code?

I wrote an awk script a little while back to convert some chess data to CSV. It handles about 160Mb of chess PGN data in about 1.9 seconds piped into it via stdin. I built an equivalent OCaml version yesterday, and stripped it down to something more-or-less the same (awk version does a bit more) so that I could measure performance. I’ve included the code below.

I’m running it as native code (built with dune), piping in the same file. It processes the file in 2.7 seconds versus 1.9 seconds with AWK (run using mawk). I was really hoping to be able to substantially beat the awk version.

Is there something that stands out that I can do to improve this code in order to beat my awk version? I’d like to keep it immutable though please.

Ocaml version:

module StringMap = Map.Make(String)

let re_pairs = Re2.create_exn "([A-Za-z]+) \"([^\"]+)"
let re_line = Re2.create_exn "^(.+)$"

let to_pairs x =
  try
    let match_ = Re2.first_match_exn re_pairs x in
    let k = Re2.Match.get match_ ~sub:(`Index 1) in
    let v = Re2.Match.get match_ ~sub:(`Index 2) in
    Some ((Option.value k ~default:""),(Option.value v ~default:""))
  with
    _ -> try Some ("Game", (Re2.find_first_exn re_line x))
         with _ -> None

let rec to_map ?(m=StringMap.empty) f =
  match f () with
  | None -> to_map ~m:m f
  | Some ("Game" as k,v) -> StringMap.add k v m
  | Some (k,v) -> to_map ~m:(StringMap.add k v m) f

let to_csv m =
  let f k= StringMap.find k m in
  Printf.printf "%s,%s,%s,%s,%s,%s,%s\n"
    (f "White") (f "Black") (f "TimeControl")
    (f "Result") (f "UTCDate") (f "WhiteElo")
    (f "BlackElo")

let rec parse() =
  let read() = to_pairs (read_line()) in
  try
    (to_map read) |> to_csv;
    parse()
  with End_of_file -> ()

let ()=parse()

AWK reference version:

{
    st = index($0, " ");
    n  = substr($0, 2, st-2);
    v  = substr($0, st+1);
    gsub(/["\]\.]/, "", v)
    x[n] = v;
    if (n == "Termination") {
	tc = x["TimeControl"];
	gsub(/[\+\-]/, ",", tc);
	res = "0";
	if (x["Result"] == "1-0") res="1";
	if (x["Result"] == "0-1") res="-1";
	printf "%s,%s,%s,%s,%s,%s,%s\n",
	    x["White"], x["Black"], tc, res,
	    x["UTCDate"], x["WhiteElo"], x["BlackElo"];
	delete x;
    }
}

You are creating a lot of useless closures and your parse function is not tail recursive. Try to make your parsing code look like:

let parse () =
  let rec loop acc = match read_line () with 
  | exception End_of_file -> acc 
  | line -> 
       match to_pairs line with 
       | None -> loop acc 
       | Some (k, v) -> loop (StringMap.add k v acc) 
  in
  loop StringMap.empty
3 Likes

Nice :slight_smile: That cuts off a few lines of code too. I’ve also cut out one of the regex expressions. It now runs in 2.2s versus awk at 1.9s. Latest version below. What more can be done?

module StringMap = Map.Make(String)

let re_pairs = Re2.create_exn "([A-Za-z]+) \"([^\"]+)"

let to_pairs x =
  try
    let match_ = Re2.first_match_exn re_pairs x in
    let k = Re2.Match.get match_ ~sub:(`Index 1) in
    let v = Re2.Match.get match_ ~sub:(`Index 2) in
    Some ((Option.value k ~default:""),(Option.value v ~default:""))
  with
    _ -> if (String.length x) > 0 then Some ("Game", "") else None

let to_csv m =
  let f k= StringMap.find k m in
  Printf.printf "%s,%s,%s,%s,%s,%s,%s\n"
    (f "White") (f "Black") (f "TimeControl")
    (f "Result") (f "UTCDate") (f "WhiteElo")
    (f "BlackElo")

let () =
  let rec loop m = match read_line () with
  | exception End_of_file -> ()
  | line -> 
       match to_pairs line with
       | None -> loop m
       | Some ("Game",_) -> (to_csv m; loop StringMap.empty)
       | Some (k, v) -> loop (StringMap.add k v m)
  in
  loop StringMap.empty

This may be a useful hint. If I reduce the whole programme to nothing but reading stdin it still takes 1.8s (of what is otherwise 2.2s total run time) !!!

let () =
  let rec loop () = match read_line () with
  | exception End_of_file -> ()
  | _ -> loop()
  in
  loop()

To that effect – unless it is the recursion itself – it seems to me that read_line may be the culprit. I tried the Jane Street version from Stdio: it made no difference.

I don’t know if it’s much faster, but for reading files I now do

let data = Arg.read_arg "day_2021_01.txt"

You get an array of strings you can then manipulate.

The program mostly handles data and therefore I’d expect the bottleneck to be I/O. Not much can be done about that. So 1.8s is probably your minimum for this amount of data.
The question is what takes up the other 0.4s. Two culprits jump to mind:

  1. Printf, which may be costly. You can try a Buffer to combine the strings.
  2. StringMap, which is seeing a lot of allocation. Try a Hashtbl instead to see if it makes a difference.
1 Like

Also, it seems like mawk is hyper-optimized for these kinds of tasks. It may not be a reasonable target to compete with.

1 Like

Do you need to use standard input or could you read from a regular file? If the latter, you might want to look into Unix.map_file: OCaml library : Unix

Not sure how I’d read one line at a time from there, but it would give you more direct access to the file’s contents.

2 Likes

awk in general is pretty damn fast for this kind of tasks

2 Likes

Thank you for the tips! I’m not keen to use Hashtbl because it is mutable.

I checked to see how long it took to just pipe the data directly from the file into /dev/null and it was 1.74 seconds. So I don’t think that it is OCaml related.

You are exactly right – its about the remainder. awk does it in 1.9-1.74 ~ 0.16s whilst OCaml needs 0.46s – about 2-3x slower.

printf takes about 0.2secs of the time.

1 Like

That’s useful to know, but I do need to use stdin. Besides, I assume awk would also be faster if I ran it over a file directly :slight_smile:

Isn’t Printf, especially after the GADT rewrite, plenty efficient? And it’s already buffered…

I wonder how slightly improving the memory layout, getting rid of try, and reducing Re calls would improve perf… Something like:

type matched = None | Game | Pair of string * string

let to_pairs x =
  match Re2.first_match re_pairs x with
  | Ok m ->
      let [|_;k;v|] = Re2.Match.get_all m in
      Pair (Option.value k ~default:"", Option.value v ~default:"")
  | Error _ ->
      if String.length x <> 0 then Game else None

(I don’t use Re2 so IDK if I’m using the API right. specifically whether get_all returns the entire match on the 0th index)

I tried all the bits except “get_all” – it returns an array of string option and its very ugly. The rest of it didn’t make a difference. I would doubt that get_all would make a difference since the get function is indexed.

Why are you using Re2 anyway ? The AWK version doesn’t use regular expressions, so by using them in OCaml you introduce an unnecessary cost (if your aim is to compare both implementations). You can replicate the AWK script using only the String module (except for gsub, but it’s not clear why you need that).

5 Likes

Care to share the data-file, and instructions for how to invoke, so we can reproduce?

So I re-wrote to_pairs, dropping regex, and using only the String module. Run time is now at ~2.0s versus AWKs ~1.9s.

let to_pairs x =
  let n = String.length x in
  match String.sub x 0 1 with
  | exception _ -> None
  | "[" ->
     (match String.index_from_opt x 2 ' ' with
     | Some i -> Some (String.sub x 1 (i-1), String.sub x (i+2) (n-i-4))
     | None   -> None)
  | "1" -> Some ("Game", "")
  | _ -> None

It feels like this is about as close as its going to get – beating awk in OCaml doesn’t feel likely for this sort of simple processing.

1 Like

Sure no problem!

Data is here.

Copy/paste the awk code from above to parse.awk and run:

time bzcat lichess_db_standard_rated_2013-01.pgn.bz2 | \
mawk -f parse.awk | wc -l

Replace the mawk bit with own exe and see if you can beat awk :slight_smile:

I think that’s right. It’s just a tiny bit of awk interpretation and mostly just calls efficient C code.

Oh, hm, the way you’re doing “time” isn’t going to give you the CPU time for the actual program. I did it differently:

chet@digg:~/Hack/Camlp5/src/lichess$ bzcat lichess_db_standard_rated_2013-01.pgn.bz2 | time mawk -f parse.awk | wc -l
1.28user 0.03system 0:03.02elapsed 43%CPU (0avgtext+0avgdata 2644maxresident)k
16inputs+0outputs (1major+148minor)pagefaults 0swaps
121332
chet@digg:~/Hack/Camlp5/src/lichess$ bzcat lichess_db_standard_rated_2013-01.pgn.bz2 | time ./lichess  | wc -l
2.46user 0.11system 0:03.78elapsed 68%CPU (0avgtext+0avgdata 21988maxresident)k
0inputs+0outputs (0major+3118minor)pagefaults 0swaps
121332

which means 1.28sec for mawk, 2.46sec for OCaml.

I haven’t done anything to try to optimize the OCaml; just figured I’d report back.

on my machines the two are nearly identical:

real    0m 4.14s
user    0m 1.75s
sys     0m 0.08s

for mawk, and

real    0m 4.17s
user    0m 1.32s
sys     0m 0.10s

for ocaml