Why there is no easy pattern matching for strings?

Hi, I have had this confusion for a while. What I really want is some thing like this:

let rec reverse x= 
  match x with
  |"" -> ""
  |e ^^ s -> (reverse s) ^ e;; 
  (** ^^ is the operator I propose to use for string constructor *)

But it seems this is never implemented. Is there a particular reason? Is this a deliberate design choice or we just don’t think it is not needed?

I am asking because sometimes I found writing a recursive algorithm on string is useful but I have to use an accumulator/index to simulate it which is a bit clumsy.

2 Likes

In short, defining string as a list of char’s is both very inefficient in term of memory and tends to lead to erroneous API when applied to text.

Indeed a text is not only a sequence of chars: consider for instance à la carte. Neither à nor the fine unbreakable space " " can be easily and uniquely mapped to 8 bits encoded characters. And for instance using an utf-8 encoding for this text and applying your reverse function will not give the expected text.

Since writing a API that maps well to the human interpretation of text is hard, a compromise is to use space efficient memory representation for strings in the standard library and let exterior libraries handle the problem of human interpreted texts.

1 Like

A while ago (2004) I wrote mikmatch, which was a syntax extension that would let you match strings against regexps and capture substrings. It was implemented with camlp4, which turned out to be problematic.

In my experience it’s very useful when routinely dealing with a variety of poorly-designed data formats. However, for software engineering in general, it’s rarely useful because data come in as json and there’s no need for further parsing.

Since moving away from bioinformatics in 2007 I only used it in one case that was really worth it and it was for parsing street addresses.

I know a few people are still using mikmatch happily but it doesn’t seem like a good choice unless you really know what you’re doing.

2 Likes

Maybe check out bitstring https://github.com/xguerin/bitstring

I used mikmatch some years ago for log parsing. In my own judgement, it was just what was needed to place OCaml ahead of Python and almost on par with Perl in quick-and-dirty text processing.

3 Likes

Now, this code is not well tested, but I created https://github.com/paurkedal/ppx_regexp. There is no release yet, so you’d have to pin it (opam pin add ppx_regexp git@github.com:paurkedal/ppx_regexp.git) or build manually.

In the original example, discussions about Unicode aside, there is an essential issue, namely that the cut-point between e and s is undetermined. Regular expressions offers a solution, as they can determine not only whether a pattern matches, but the extents of marked subpatterns. The example can now be written

let rec reverse x =
  match%pcre x with
  | "^$" -> ""
  | "(?<e>.)(?<s>.*)" -> reverse s e
1 Like

On the subject of mikmatch, tyre can pretty much act as a modern typed version of mikmatch, without using any syntax extensions. In particular, it does efficient routing and has very nice abstraction feature and other goodies:

let re = Tyre.route [
  str"foo" *> int   --> fun i -> print_int i;
  str"bar" *> float --> fun f -> print_float f;
]

let () = Tyre.exec re "foo4"

Since it’s combinator-based, it’s a bit less terse, but you gain in readibility. I never added a ppx extension for it, but I guess it could be useful (using conventions like in format, for example).

(Note that a reverse function should never be written like that anyway. It’s wrong both in term of efficiency and semantics.)

4 Likes