Regexp matches in the Str library

In the documentation OCaml library : Str, one has

val matched_string : string -> string

matched_string s returns the substring of s that was matched by the last call
to ... provided ...

This looks like a “fragile” function since it depends on previous calls to other functions, and its behavior is unspecified: there is no way to know “the substring of s that was matched” without looking at the implementation, it seems ?

One has string_match : regexp -> string -> int -> bool, so for instance, string_match (regexp "a*") "aa" 0 is true, but what was the substring matched ? It could be "" or "a" or "aa", and it is unspecified which will be matched_string "aa".

Unfortunately, the code ocaml/str.ml at trunk · ocaml/ocaml · GitHub has a lot of external and Domain.DLS which I do not know where to look for and I cannot understand it.

I would expect to have a function like

longest_match : regexp -> string -> ?(pos = 0) -> string option

Is it possible to have such a function, which is “robust” in that it does not depend on previous function calls ?

It looks like on my computer, string_match actually does a “longest match”, but maybe this depends on my architecture or something else ? Provided that it is always a longest match (is it?), it looks like

let longest_match r s ?(pos = 0) =
  if string_match r s pos then Some (matched_string s) else None

would work, because it ensures that matched_string is called immediately after string_match and with the same s? (provided also that if String.length s < pos, then string_match r s pos = false)

Or maybe my wish to have longest_match instead of matched_string reflects bad coding style ?

Bonus question: does ocamllex use Str or does it do its own regexp work ?

The str library is very old and has a number of shortcomings, such as having an effectful API and depending on global state. It is not generally recommended to be used in new code.

Have you looked at ocaml-re GitHub - ocaml/ocaml-re: Pure OCaml regular expressions, with support for Perl and POSIX-style strings instead?

Cheers,
Nicolas

3 Likes

ocamllex does not use str; it has its own implementation of regular expressions.

Cheers,
Nicolas

2 Likes

To my knowledge, it’s always been known and well-documented that it’s got these shortcomings. From the earliest time I was aware of Str, I remember warnings about multi-thread-un-safety.

My own “instant” instinct is to reach for pcre, but maybe I should look into ocaml-re.

1 Like

Then tell me where a newcomer should search for this kind of information/documentation. I have already been told to read the manual (in more or less polite ways, not here which is more welcoming), but when I read the manual page OCaml - The str library: regular expressions and string processing and the API page OCaml library : Str, I don’t see any mention of it.

Heh indeed, now that I search, I can’t find any such warning. And yet, I know that I’ve seen them many times, b/c that’s why I was led to seek out pcre, lo these many years.

But OTOH, for a newcomer, these issues are irrelevant, right? multi-threaded programming is inherently unsafe, and newcomers who aren’t sufficiently well-versed in ferreting out the pitfalls shouldn’t be doing it, I would think. [Full disclosure: I’ve been making this argument about Java programming for decades, and have the (ahem) receipts to back it up, so this isn’t just about OCaml]. And for single-threaded programming, for the simple use-cases, Str is fine. Sure, the API is a bit unsafe, but for getting started, it’s fine. Heck, I use it too, when I just need to write some throwaway regexp code.

1 Like

To your initial question–yes, as you noticed, the Str module does in fact work in an imperative style with global state, which is not really very functional-style. But, as you also noticed, it can also be wrapped in a safer, more functional-style function. Personally I find the Str module quite handy and quite safe to use, provided I take some care with it.

Regarding your specific longest_match function, if that is what you need for your code, I say there’s absolutely nothing wrong with it. That’s the nice thing about coding–you can build safe, immutable abstractions on top of ‘unsafe’, mutable ones.

1 Like

I doubt a warning is really necessary, since it is obvious from the documentation on match_string and cognate functions, and the signatures of the various search functions, that hidden state is kept, and that is probably what you were thinking about. I imagine this goes back to the mists of time, when thinking about state in ocaml was less developed. By today’s standards I would also argue that the standard library’s use of exceptions is excessive. End-of-file is not a exceptional condition for example.

For what it is worth in ocaml-5 the Str module is domain-thread safe as each domain keeps its own match state. It remains Thread module unsafe but as I understand it in ocaml-5 you are encouraged to use effects instead, and with effects you have control of when your code is to yield to the event loop.

For sure, somebody reading the API docs would (as @user1 did) conclude there there’s something smelly in there. But I really do remember (though, gray hairs, not too many left, maybe it’s a false memory) such a warning. Ah, well.

FWIW, back in the bad old days (maybe still today) Java’s Calendar object had a thread-unsafe method that wasn’t obviously unsafe – the implementors just did a piss-poor job, is all. Boy howdy that blew up some commercial web-apps.