Indeed! Thanks for the mention. Reposting below.
The strategy used by the OCaml compiler is not naive, but not hashing either. It will compile a switch on strings (with a default case) into a tree of switches on characters (in fact words, see below), by reading characters at specific positions in the string. It was implemented by Benoît Vaugon and Luc Maranget.
The position to read is selected as the least discriminating position: we count the number of different values (among the string patterns) for each position, and choose the (leftmost) position with the smallest number of possible values. The idea is that these positions must be tested anyway to eliminate the default case, so you may as well check them first.
For example if you are matching on “aa” vs “ab” (or something else, the default case), and you know that your input has size 2, you may generate either:
switch s[0]:
case 'a':
switch s[1]:
case 'a': goto <case "aa">
case 'b': goto <case "ab">
default: goto <default case>
default: goto <default case>
or
switch s[1]:
case 'a':
switch s[0]:
case 'a': goto <case "aa">
default: goto <default case>
case 'b':
switch s[0]:
case 'a': goto <case "ab">
default: goto <default case>
default: goto <default case>
and the first approach is more interesting, it performs the same tests for each non-default value with shorter code.
This is counter-intuitive: check the least discriminating position first. It is more intuitive at first to check the most discriminating position.
Regarding size: you start by switching on the size of the string, then you are left with groups of the same size, where the same positions are valid.
Regarding characters: instead of looking up strings characters by characters, you can read whole machine word at once, except at the very end of the string. In the case of {"one", "two", "three"}
, with the 0-ended representation of C, a single 32bit read is enough to distinguish the three values, so you need a single position switch (after switching on the size first). (In fact the OCaml value represeentation guarantees that strings always contain a full amount of machine words, thanks to some padding after the end of the user-provided part.)