Richard,
I had some thoughts about using pcre
in order to make the matching more efficient. Then some more thoughts, and then some more. In the end, it seems like one can generate a single regexp that performs all the desired matching:
- for zeroes (more than some minimum number)
- for repeats of one-char up to {max_period=8}-char
- and if nothing else matches, then for a single char
in a single regexp.
Sadly, this means there’s no need to use lazy lists, functional streams, or anything else from FP. Just … well, the gynormous sledgehammer of the full power of regexps with capture groups and backreferences. Which is (IIRC) somewhere “up there” in the algorithmic complexity hierarchy. And (of course) trust that the regexp compiler/runtime implementers have worked really, really hard for decades at getting the algorithms right. It also means that however fast one makes an Ocaml version, one can make a Perl version that is … more-or-less just as fast.
The observation that makes it work, is that regexp backreferences can be used in the match-regexp, and are enormously powerful. For instance,
$s !~ m/^(11+)\1+$/
is successful for $s
that are ones of length a prime number. Insane, I know. So of course, you can write repetitions the same way. And you can use disjunction and capture-groups to match … well, here’s the full generated regexp:
# D2d2.all_pat ;;
- : string =
"(?:(\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00)\\x00*)|(?:(.)\\g{-1}+)|(?:(..)\\g{-1}+)|(?:(...)\\g{-1}+)|(?:(....)\\g{-1}+)|(?:(.....)\\g{-1}+)|(?:(......)\\g{-1}+)|(?:(.......)\\g{-1}+)|(?:(........)\\g{-1}+)|(.)"
There are 10 disjuncts
#0: at least 8 zero-bytes
#1: a char, followed by at least one repetition of that char – remember the char in a capture-group
#2…#8: the same as #1, but for sequences 2, 3, 4, … 8 chars.
#9: a single char
Here’s a single disjunct: (?:(....)\\g{-1}+)
and what that means is that we (a) match four arbitrary chars, (b) memorize that in a capture-group, and ( c) match at least one (or more) instance of those same four chars. The \\g{-1}
references the immediately-preceding capture-group. And in the entire regexp, capture-groups are numbered left-to-right, starting with 1
.
When this regular expression is successfully matched, the first capture-group that is non-empty (notice that every arm has a capture-group) tells us which arm of the disjunction was matched, and that plus the length of the matched string tells us how many repetitions (if any). From there, generating the desired output is pretty straightforward.
The only hitch (right now) is that the program reads in the entire file in one go. But this is straightforward to fixup: for zeroes or other repetitions, reduce the number of repetitions by one, shift aside the matched data, read-and-append more data from the file, and re-apply the arm of the disjunction that was matched. All very straightforward to code up.
The code for this is in the same project, in file: https://github.com/chetmurthy/sandbox-public/blob/master/nbdkit/d2d2.ml
and the key bits of code start around the value all_re
.
Unsurprisingly, it’s a lot faster than the Perl script (but then, it ought to be, b/c all the heavy lifting is done by the C regexp runtime).
$ /usr/bin/time ./disk2data.pl disk
nbdkit data data="
@0x1b8 167 171 122 252 0 0 0 0 2 0 131 32 32 0 1 0 0 0 255 7
@0x1fe 85 170
" size=1048576
0.29user 0.00system 0:00.29elapsed 100%CPU (0avgtext+0avgdata 4628maxresident)k
0inputs+0outputs (0major+278minor)pagefaults 0swaps
$ /usr/bin/time ./d2d2 disk
@0x1b8 167 171 122 252 0*4 2 0 131 32*2 0 1 0*3 255 7 @0x1fe 85 170
@0x100000
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 5044maxresident)k
0inputs+0outputs (0major+649minor)pagefaults 0swaps