Constant string pattern matching

Let’s say I write something like this:

match lexeme with
| "foo00" -> Foo00
| "foo01" -> Foo01
  ...
| "foo50" -> Foo50
| _ -> FooError

(My real keywords are not uniform like this.)
Does this compile to the obvious straight linear search, or is there any cleverness (like a constant hidden hashtable perhaps)? Or should I have a hashtable myself?

2 Likes

You can play around on godbolt: Compiler Explorer

Looks like a binary search, so pretty clever.

1 Like

@gasche has a comprehensive comment on how OCaml compiles string pattern matching on r/ProgrammingLanguages.

2 Likes

Indeed! Thanks for the mention. Reposting below.


The strategy used by the OCaml compiler is not naive, but not hashing either. It will compile a switch on strings (with a default case) into a tree of switches on characters (in fact words, see below), by reading characters at specific positions in the string. It was implemented by Benoît Vaugon and Luc Maranget.

The position to read is selected as the least discriminating position: we count the number of different values (among the string patterns) for each position, and choose the (leftmost) position with the smallest number of possible values. The idea is that these positions must be tested anyway to eliminate the default case, so you may as well check them first.

For example if you are matching on “aa” vs “ab” (or something else, the default case), and you know that your input has size 2, you may generate either:

switch s[0]:
  case 'a':
    switch s[1]:
      case 'a': goto <case "aa">
      case 'b': goto <case "ab">
      default: goto <default case>
  default: goto <default case>

or

switch s[1]:
  case 'a':
    switch s[0]:
      case 'a': goto <case "aa">
      default: goto <default case>
  case 'b':
    switch s[0]:
      case 'a': goto <case "ab">
      default: goto <default case>
  default: goto <default case>

and the first approach is more interesting, it performs the same tests for each non-default value with shorter code.

This is counter-intuitive: check the least discriminating position first. It is more intuitive at first to check the most discriminating position.

Regarding size: you start by switching on the size of the string, then you are left with groups of the same size, where the same positions are valid.

Regarding characters: instead of looking up strings characters by characters, you can read whole machine word at once, except at the very end of the string. In the case of {"one", "two", "three"}, with the 0-ended representation of C, a single 32bit read is enough to distinguish the three values, so you need a single position switch (after switching on the size first). (In fact the OCaml value represeentation guarantees that strings always contain a full amount of machine words, thanks to some padding after the end of the user-provided part.)

7 Likes

The compiler explorer shows assembly output, but often other intermdiate representations are more readable (to me at least). The example of @copy is

let test x =
    match x with
    | "barfoofoo00" -> 1
    | "barfoofoo01" -> 2
    | "barfoofoo02" -> 3
    | "barfoofoo03" -> 4
    | "barfoofoo04" -> 5
    | "barfoofoo05" -> 6
    | "barfoofoo06" -> 7
    | "barfoofoo07" -> 8
    | "barfoofoo08" -> 9
    | "barfoofoo09" -> 10
    | "barfoofoo10" -> 11
    | "barfoofoo11" -> 12
    | "barfoofoo12" -> 13
    | "barfoofoo13" -> 14
    | "barfoofoo14" -> 15
    | "barfoofoo15" -> 16
    | "barfoofoo16" -> 17
    | "barfoofoo17" -> 18
    | "barfoofoo18" -> 19
    | "barfoofoo19" -> 20
    | _ -> -42

and the -dcmm output is as follows: (remember that integers are tagged in this representation, so for example 39, 2*19+1, is the encoding of 19 in the source.)

(function camlTest__test_267 (x/269: val)
 (catch
   (let size/274 (>>u (load_mut int (+a x/269 -8)) 10)
     (if (!= size/274 2) (exit 1)
       (let cell/272 (load_mut int (+a x/269 0))
         (if (== cell/272 8027225910085312866)
           (let cell/273 (load_mut int (+a x/269 8))
             (if (< cell/273 288230376155197551)
               (if (< cell/273 288230376155001199)
                 (if (< cell/273 288230376154935407)
                   (if (== cell/273 288230376154869871) 3
                     (if (== cell/273 288230376154870127) 23 (exit 1)))
                   (if (== cell/273 288230376154935407) 5
                     (if (== cell/273 288230376154935663) 25
                       (if (== cell/273 288230376155000943) 7 (exit 1)))))
                 (if (< cell/273 288230376155066735)
                   (if (== cell/273 288230376155001199) 27
                     (if (== cell/273 288230376155066479) 9 (exit 1)))
                   (if (== cell/273 288230376155066735) 29
                     (if (== cell/273 288230376155132015) 11
                       (if (== cell/273 288230376155132271) 31 (exit 1))))))
               (if (< cell/273 288230376155328879)
                 (if (< cell/273 288230376155263087)
                   (if (== cell/273 288230376155197551) 13
                     (if (== cell/273 288230376155197807) 33 (exit 1)))
                   (if (== cell/273 288230376155263087) 15
                     (if (== cell/273 288230376155263343) 35
                       (if (== cell/273 288230376155328623) 17 (exit 1)))))
                 (if (< cell/273 288230376155394415)
                   (if (== cell/273 288230376155328879) 37
                     (if (== cell/273 288230376155394159) 19 (exit 1)))
                   (if (== cell/273 288230376155394415) 39
                     (if (== cell/273 288230376155459695) 21
                       (if (== cell/273 288230376155459951) 41 (exit 1))))))))
           (exit 1)))))
 with(1) -83))
2 Likes

Out of personal curiosity are you also nobrowser?

If so then you might find this of interest:
Does TerminusDB rely on a Bloom filter of similar deep down under?

Thanks all, this is awesome. I’ll know I can mostly rely on the pattern match from now on.

@EricGT yes, that is me, of course.