Numerical ranges and pattern matching


#1

What always surprised me, that OCaml allows to put range of characters like ‘0’…‘9’ but not a range of numbers.
Why is that and maybe it is a time to add this feature in the compiler? Seems very inconvenient and inconsistent.


#2

This is a natural idea, but so far nobody has produced a patch to implement the feature, because it is harder to implement than one may expect. If I remember correctly, character ranges '0'..'4' are exploded into or-patterns '0'|'1'|'2'|'3'|'4', and manipulated in that non-compact form in the pattern-matching compiler (which later optimizes continuous ranges to compact/efficient code, so it doesn’t show in the compiled output). If you tried to extend this to integer ranges, you would get compile-time blowups for any large range. Instead you have to change the structure of the pattern-matching compiler to support a first-class notion of range, but that is an ambitious refactoring to a part of the compiler codebase that is unpleasant to work with. Nobody has done it for now.


#3

So it is more a matter of it requiring difficult work in the compiler than that there’s a reason in principle that it would be a bad idea?


#4

Though numbers could be desugared into | x when x > = 0 && x <= 9, no?


#5

Presumably that’s one method, though there’s a wide variety of tricks in the compiler literature for handling large switch or case style statements in languages with such constructs. Depending on the details of the particular multiway conditional, any one of several implementation techniques might be optimal. Sufficiently careful compilers figure out which one to use based on the specific instance.


#6

Remember that the compiler performs an exhaustivity check of pattern matchings. Introducing integer ranges as patterns with when clauses isn’t satisfactory because they render the detection of missing cases non decidable.


#7

Though numbers could be desugared into | x when x > = 0 && x <= 9 , no?

This desugaring is not correct because when can only exist at the top-level of the whole clause, whereas interval patterns can be used in-depth. For example, consider the or-pattern

(Some (0..9, true) | Some (100..109, false))

#8

Maybe you could desugar this (if you don’t want to modify the actual compiler) into multiple octets and preserve exhausting checking. Not that it would be particularly beautiful or efficient, but numerical range matching is such as a common problem that everyone just implements it with when guards today, which is not very pretty either.

ie something like (with a 31bit int type on x86):

match 123 with
| 0 .. 122 -> 1
| 10 -> .
| 123 .. 4096 -> 2

->

 (* shifted to account for GC bit *)
match '\123',       '\000',      '\000', '\000' with
|  '\000'..'\122',  '\000',      '\000', '\000' -> 1
|     '\010',       '\000',      '\000', '\000' -> .
| (   '\000',    '\123'..'\255', '\000', '\000'
  |   '\000',       '\016',      '\000', '\000'
  |      _  ,    '\000'..'\015', '\000', '\000') -> 2

Incidentally this invokes the exhaustiveness checker which trips somewhat somewhat as this takes 20 seconds to evaluate on my machine after outputting 1648 lines of missing cases, but perhaps that should be fixed upstream anyway to make it abort earlier:

$ ocaml a.ml 
File "./a.ml", line 1, characters 11-283:
Warning 8: this pattern-matching is not exhaustive.
Here is an example of a case that is not matched:
(('\000', '\001', '\000', 'a')|('\000', '\001', 'a', _)|
('\000', '\002', '\000', 'a')|('\000', '\002', 'a', _)|
(* ... another 1646 examples ... *)

If you add a | _ -> 4 to my example it works as intended.

EDIT: Fixed off-by-one, added example:

let mymatch ch ch2 ch3 ch4 = match ch, ch2, ch3, ch4 with
| '\000' .. '\122', '\000', '\000', '\000' -> 1
| '\010',           '\000', '\000', '\000' -> .
| ( '\000', '\123'..'\255', '\000', '\000'
  | '\000', '\016',         '\000', '\000'
  |  _    , '\000'..'\015', '\000', '\000') -> 2
| _ -> 3

let pack_int =
  let int_bytes = (Sys.int_size + 1) / 8 in
  let buf = Bytes.init int_bytes (fun _ -> '\000') in
  let rec loop n acc =
   if n = ~-1 then buf else begin
   Bytes.set buf (7-n) (char_of_int (acc land 0xff)) ;
   loop (n-1) (acc lsr 8) end in
  loop (int_bytes-1)

let pack_int31 x =
  if Sys.int_size <> 63 then pack_int x
  else Bytes.sub (pack_int x) 0 4

let print x =
  let b = (pack_int31 x) in
  print_int Bytes.(mymatch (get b 0) (get b 1) (get b 2) (get b 3)) ;
  print_newline ()

let () = print 0x3231

let () = print 10
let () = print 122
let () = print 123
let () = print 125

this compiles to:

0000000000018d10 <camlA__mymatch_1002>:
   18d10:       48 83 f8 01             cmp    $0x1,%rax
   18d14:       74 0a                   je     18d20 <camlA__mymatch_1002+0x10>
   18d16:       48 3d f7 00 00 00       cmp    $0xf7,%rax
   18d1c:       7d 4a                   jge    18d68 <camlA__mymatch_1002+0x58>
   18d1e:       eb 2c                   jmp    18d4c <camlA__mymatch_1002+0x3c>
   18d20:       48 83 fb 21             cmp    $0x21,%rbx
   18d24:       74 1a                   je     18d40 <camlA__mymatch_1002+0x30>
   18d26:       48 81 fb f7 00 00 00    cmp    $0xf7,%rbx
   18d2d:       7c 1d                   jl     18d4c <camlA__mymatch_1002+0x3c>
   18d2f:       48 83 ff 01             cmp    $0x1,%rdi
   18d33:       75 17                   jne    18d4c <camlA__mymatch_1002+0x3c>
   18d35:       48 83 fe 01             cmp    $0x1,%rsi
   18d39:       74 49                   je     18d84 <camlA__mymatch_1002+0x74>
   18d3b:       eb 0f                   jmp    18d4c <camlA__mymatch_1002+0x3c>
   18d3d:       0f 1f 00                nopl   (%rax)
   18d40:       48 83 ff 01             cmp    $0x1,%rdi
   18d44:       75 06                   jne    18d4c <camlA__mymatch_1002+0x3c>
   18d46:       48 83 fe 01             cmp    $0x1,%rsi
   18d4a:       74 38                   je     18d84 <camlA__mymatch_1002+0x74>
   18d4c:       48 83 fb 01             cmp    $0x1,%rbx
   18d50:       75 16                   jne    18d68 <camlA__mymatch_1002+0x58>
   18d52:       48 83 ff 01             cmp    $0x1,%rdi
   18d56:       75 10                   jne    18d68 <camlA__mymatch_1002+0x58>
   18d58:       48 83 fe 01             cmp    $0x1,%rsi
   18d5c:       75 0a                   jne    18d68 <camlA__mymatch_1002+0x58>
   18d5e:       48 c7 c0 03 00 00 00    mov    $0x3,%rax
   18d65:       c3                      retq   
   18d66:       66 90                   xchg   %ax,%ax
   18d68:       48 83 fb 21             cmp    $0x21,%rbx
   18d6c:       7d 0e                   jge    18d7c <camlA__mymatch_1002+0x6c>
   18d6e:       48 83 ff 01             cmp    $0x1,%rdi
   18d72:       75 08                   jne    18d7c <camlA__mymatch_1002+0x6c>
   18d74:       48 83 fe 01             cmp    $0x1,%rsi
   18d78:       75 02                   jne    18d7c <camlA__mymatch_1002+0x6c>
   18d7a:       eb 08                   jmp    18d84 <camlA__mymatch_1002+0x74>
   18d7c:       48 c7 c0 07 00 00 00    mov    $0x7,%rax
   18d83:       c3                      retq   
   18d84:       48 c7 c0 05 00 00 00    mov    $0x5,%rax
   18d8b:       c3                      retq   
   18d8c:       0f 1f 40 00             nopl   0x0(%rax)

#9

Just for the record - opened bug in the OCaml repo: https://github.com/ocaml/ocaml/issues/8504