Pattern matching on a lazy list

I’m trying to rewrite this Perl code in OCaml. The Perl code is horrible and very imperative. It strikes me it could be better if I could use OCaml and pattern matching.

You can think of this code as something which “compresses” disk images into a simpler format, eg. a series of bytes 0 0 0 1 2 1 2 might be written as @3 (1 2)*2. The Perl script takes a disk image and tries to perform that transformation.

For example the part of the code that looks for short-period repeated data might be rewritten as:

match xs with
| a :: b :: xs when a = b ->
  let n = count_repeats [a] xs in
  printf "%d*%d" a (n+2)
| a :: b :: c :: d :: xs when [a; b] = [c; d] ->
  let n = count_repeats [a; b] xs in
  printf "(%d %d)*%d" a b (n+2)
[etc]

given a suitable count_repeats function.

This is easy enough, but the files are huge and I don’t want to have to load them all into memory. So it would also be nice to use a lazy list of some kind.

Can pattern matching be used over a lazy list like this? I tried to use the lazy(pattern) feature added in OCaml 3.11, but firstly I don’t think it can be applied to this case, and secondly I can’t really see how it would be used at all. (Does anyone have an example of real code actually using this feature?)

1 Like

Hi Rich,

What follows is just a sketch (and is not tested), but perhaps it can serve as inspiration.

If you define a type of lazy lists by

type 'a t = Nil | Cons of 'a * 'a t Lazy.t

then I think you can express what you want by

match xs with
| Cons (a, lazy (Cons (b, xs))) when a = b ->
    ...
| Cons (a, lazy (Cons (b, lazy (Cons (c, lazy (Cons (d, xs))))))) when [a; b] = [c; d] ->
    ...

Of course, everything depends on the list xs actually being read lazily, eg

let read_file ic : char t =
  let buf = Bytes.create 101 in
  let max = ref 0 in
  let pos = ref 0 in
  let rec loop () =
    if !pos >= !max then (* refill buffer *)
      (pos := 0; max := input ic buf 0 (Bytes.length buf));
    if !max = 0 then Nil else
    let c = Bytes.get buf !pos in (incr pos; Cons (c, lazy (loop ())))
  in
  loop ()

will read a file into a lazy list without ever reading more than 101 bytes at a time.

Cheers,
Nicolás

5 Likes

To add to @nojb’s excellent answer, you can also redefine the list syntax for nefarious purposes (achieving identical results as if you had used Cons and Nil, but with a fun confusion factor):

type 'a llist =
  | ( :: ) of 'a * 'a llist Lazy.t
  | []

let naturals =
  let rec inner () = Random.int 5 :: lazy (inner ()) in
  lazy (inner ())

let rec consume = function
  | lazy [] -> ()
  | lazy (n :: lazy (m :: xs)) when n = m -> Printf.printf "2 * %d\n%!" n; consume xs
  | lazy (n :: xs) ->                        Printf.printf "1 * %d\n%!" n; consume xs

;; consume naturals

Doing this is sometimes forgivable, provided llist is kept inside a module that is only opened locally.

4 Likes

Richard, here’s an implementation using Camlp5’s functional stream parsers. It’s not the best, partially b/c the implementation of functional stream parsers (pa_fstream) has a bug in the compilation of parsers (lack of tail-recursion optimization). So I coded the one function that was affected by-hand.

[Well, there’s another, but it’s not as grave there, and decided to leave it with the possibility of stack-overflow so I can just go and fix pa_fstream, since the bug is simple-to-fix.]

I don’t find this code particularly compelling (or beautiful), and it sure isn’t efficient (even interpreted Perl beats it on the 1M disk example in your perl script).

1 Like

Richard,

I had some thoughts about using pcre in order to make the matching more efficient. Then some more thoughts, and then some more. In the end, it seems like one can generate a single regexp that performs all the desired matching:

  1. for zeroes (more than some minimum number)
  2. for repeats of one-char up to {max_period=8}-char
  3. and if nothing else matches, then for a single char
    in a single regexp.

Sadly, this means there’s no need to use lazy lists, functional streams, or anything else from FP. Just … well, the gynormous sledgehammer of the full power of regexps with capture groups and backreferences. Which is (IIRC) somewhere “up there” in the algorithmic complexity hierarchy. And (of course) trust that the regexp compiler/runtime implementers have worked really, really hard for decades at getting the algorithms right. It also means that however fast one makes an Ocaml version, one can make a Perl version that is … more-or-less just as fast.

The observation that makes it work, is that regexp backreferences can be used in the match-regexp, and are enormously powerful. For instance,

$s !~ m/^(11+)\1+$/

is successful for $s that are ones of length a prime number. Insane, I know. So of course, you can write repetitions the same way. And you can use disjunction and capture-groups to match … well, here’s the full generated regexp:

# D2d2.all_pat ;;
- : string =
"(?:(\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00)\\x00*)|(?:(.)\\g{-1}+)|(?:(..)\\g{-1}+)|(?:(...)\\g{-1}+)|(?:(....)\\g{-1}+)|(?:(.....)\\g{-1}+)|(?:(......)\\g{-1}+)|(?:(.......)\\g{-1}+)|(?:(........)\\g{-1}+)|(.)"

There are 10 disjuncts

#0: at least 8 zero-bytes
#1: a char, followed by at least one repetition of that char – remember the char in a capture-group
#2#8: the same as #1, but for sequences 2, 3, 4, … 8 chars.
#9: a single char

Here’s a single disjunct: (?:(....)\\g{-1}+) and what that means is that we (a) match four arbitrary chars, (b) memorize that in a capture-group, and ( c) match at least one (or more) instance of those same four chars. The \\g{-1} references the immediately-preceding capture-group. And in the entire regexp, capture-groups are numbered left-to-right, starting with 1.

When this regular expression is successfully matched, the first capture-group that is non-empty (notice that every arm has a capture-group) tells us which arm of the disjunction was matched, and that plus the length of the matched string tells us how many repetitions (if any). From there, generating the desired output is pretty straightforward.

The only hitch (right now) is that the program reads in the entire file in one go. But this is straightforward to fixup: for zeroes or other repetitions, reduce the number of repetitions by one, shift aside the matched data, read-and-append more data from the file, and re-apply the arm of the disjunction that was matched. All very straightforward to code up.

The code for this is in the same project, in file: https://github.com/chetmurthy/sandbox-public/blob/master/nbdkit/d2d2.ml

and the key bits of code start around the value all_re.

Unsurprisingly, it’s a lot faster than the Perl script (but then, it ought to be, b/c all the heavy lifting is done by the C regexp runtime).

$ /usr/bin/time ./disk2data.pl disk
nbdkit data data="
  @0x1b8 167 171 122 252 0 0 0 0 2 0 131 32 32 0 1 0 0 0 255 7 
  @0x1fe 85 170 
  " size=1048576
0.29user 0.00system 0:00.29elapsed 100%CPU (0avgtext+0avgdata 4628maxresident)k
0inputs+0outputs (0major+278minor)pagefaults 0swaps
$ /usr/bin/time ./d2d2 disk
@0x1b8 167 171 122 252 0*4 2 0 131 32*2 0 1 0*3 255 7 @0x1fe 85 170
@0x100000
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 5044maxresident)k
0inputs+0outputs (0major+649minor)pagefaults 0swaps
2 Likes