I’m happy to announce the release of Regenerate, a tool to generate test cases for regular expression engines. You can read the details in the link below, or play around with the online demo.
The blog post mostly focused on the user’s part of the story, so here is the developer part, mostly of interest to OCaml developers.
Modularity/Implementation, or how to explore performance trade-offs in OCaml
The main part of the algorithm is doing some fancy interleaving and weird walks over streams of collections of words. Right away, there is one big question that you can ask: Which data structures to choose?
Collections mostly used set operations such as union, intersection, difference, some list-like operations such as append, and some fancier things such as n-way merges and cartesian products. Furthermore, lazyness and persistency played a big role as well.
Preliminary groundwork
Instead of trying to do some guesswork, I decided to specify formally all the functions that I needed to implement the algorithms. I ended up with this signature:
module type S = sig
type elt
type t
val empty : t
val is_empty : t -> bool
val return : elt -> t
val append: t -> t -> t
val union : t -> t -> t
val inter : t -> t -> t
val diff : t -> t -> t
val merge : t list -> t
val of_list : elt list -> t
val to_seq : t -> elt Sequence.t
val memoize : t -> t
end
This has several advantages: it made the requirement for data structures very clear,
it separated the datastructure from the algorithm and it also reflected the theory
quite closely, as it exactly represent the operation used by the algebra we were working on.
Note the presence of an explicit memoize
function, which allowed me to also consider
transient datastructures.
Functorize all the things
Once we have a proper idea of the current operations we are considering, it’s
very easy to functorize everything:
module Make
(Word : Word.S)
(Segment : Segments.S with type elt = Word.t)
(Sigma : SIGMA with type t = Segment.t)
: sig .... end
In this case, Segment
is the collection of words. I also functorized the definition of words (so I can test with strings, list of character, ropes, utf8 …) and the definition of the alphabet SIGMA
.
This is the part where you do :science: and implement weird algorithms.
Testing data-structures
We now have a functorized implementation of our algorithm, we can easily implement
lot’s of data-structure that satisfy the signatures. I only did a few implementation of words by I tested lot’s of collections. Containers was very useful for this.
Optimizations
You now need to appeal to the flambda gods by adding [@@inline always]
on all you functors and pass -O3
to the compiler. You might also want to assign the result of functor applications at toplevel. With all this, functors should be fully expanded at compile time and not prevent any optimizations. If you want to tweak things more, consult this guide.
Bench
Now that everything is set up, you can bench! In my case, I was intested by the rate of words/second emitted, depending on the considered regular expression. The results regarding collections were quite interesting:
If you want more details, look here.
Making the API usable
You wrote all your functors and your implementation, you benchmarked it, and you are happy. Unfortunately, for some reason, users tend to consider functor-heavy APIs impossible to use (cough ocamlgraph cough irmin). There are two solutions to this problem: pick a specific instance that work for most people (for example, the most efficient one) or hide all the functors away by making things first class.
In the case of regenerate, I decided for a sort of middle ground with first class modules. The API is still not exactly simple, but I suspect the examples should be enough.
Website
This doesn’t get said enough, but js_of_ocaml is bloody fantastic. The code for the website is dead simple and has the remarquable property of being completely uninteresting. I dumped containers, sequence, oseq, fmt, qcheck and all my functors into it, it worked without any modifications to the libraries and the resulting size is just 33KB. End of the story.
Thanks for this Drup! Besides the obvious use cases, there’s an extra area where this tool would be useful: to aid in pass word/phrase recovery. The situation where the user sort-of-knows their password but is unsure about a substring: this tool will come in handy to generate a list of all possible passwords that match a given regular expression.