Full_split in pcre

I don’t know if this is a good place to ask this, but I’m having trouble understanding what Pcre.full_split is supposed to do, in particular what the Group and NoGroup elements in its return value mean.

The ocaml-pcre documentation doesn’t give me any help – it just says “Should behave exactly as in PERL” – and the pcre documentation doesn’t mention a split function at all. Perl doesn’t have a full_split, but as described in perlfunc it does automatically include the values of matching groups from the regex. So, with the documented behavior of Str.full_split in mind, which returns the actual values of the delimiter strings with a tag on them, I figure Pcre.full_split is supposed to do something like this.

After some experimentation, it seems that Pcre.full_split returns a Text value for every text string in between delimiters and a Delim value for each delimiter itself, along with for each (...) subgroup in the regex either a Group or a NoGroup value, in order following the Delim. So, for instance,

# full_split ~pat:"([xy])([uv])" "abxucd";;
- : split_result list =
[Text "ab"; Delim "xu"; Group (1, "x"); Group (2, "u"); Text "cd"]

where the single delimiter xu becomes a Delim giving the full value of the delimiter string, plus Groups giving the values of the two subgroups in the regex.

So far so good. What I’m confused about is what happens when a particular subgroup doesn’t match (because it is part of an alternative not taken). In this case, Perl returns undef in the relevant location, and my guess was that NoGroup plays this role in Pcre. Sometimes this is right:

# full_split ~pat:"(x)|(u)" "abxcd";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"]

# full_split ~pat:"(x)|(u)" "abucd";;
- : split_result list =
[Text "ab"; Delim "u"; NoGroup; Group (2, "u"); Text "cd"]

Oddly, NoGroup isn’t tagged by the number of the group the way Group is (though they always seem to appear in order, so in both cases the number is strictly speaking redundant). But more importantly, sometimes a non-matching group seems to be represented by a Group with the empty string rather than a NoGroup:

# full_split ~pat:"(x)|(u)" "abxcduef";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"; Delim "u";
 Group (1, ""); Group (2, "u"); Text "ef"]

Why does it give Group (1, "") instead of NoGroup? Is this a bug?

When I was implementing pa_ppx_regexp I ran into these perplexing issues also. And what’s worse, pcre differs from re. Joy. And for sure, I think re does it better than pcre does.

So I implemented my own version of full_split[1]:

# [%split {|(x)|(u)|} / strings re_perl] "abxcd";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
= [`Text "ab"; `Delim ("x", Some "x", None); `Text "cd"]
# [%split {|(x)|(u)|} / strings pcre] "abxcd";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
= [`Text "ab"; `Delim ("x", Some "x", None); `Text "cd"]
# [%split {|(x)|(u)|} / strings re_perl] "abxcduef";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
=
[`Text "ab"; `Delim ("x", Some "x", None); `Text "cd";
 `Delim ("u", None, Some "u"); `Text "ef"]
# [%split {|(x)|(u)|} / strings pcre] "abxcduef";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
=
[`Text "ab"; `Delim ("x", Some "x", None); `Text "cd";
 `Delim ("u", None, Some "u"); `Text "ef"]
# 

I don’t remember whether I dug into the pcre code to determine whether my work-alike was slower than the pcre version; at some point I should do that, I guess.

[1] https://github.com/camlp5/pa_ppx_regexp/blob/master/runtime/pa_ppx_regexp_runtime.ml

1 Like

I have been debugging your problem, and I believe that there’s a bug in the wrapper code of pcre. I’ll confirm and send a PR over to the current maintainer.

FYI: I opened an issue: A possible bug in full_split ? · Issue #29 · mmottl/pcre-ocaml · GitHub . Please feel free to comment, etc.

3 Likes

Makes sense, thanks!

Sorry for the long delay, this issue should be fixed in the development version now. Please feel free to test it before I make a release. I will also try to port any fixes to the newer PCRE2 library.

Thanks! How do I install the development version correctly to test it? I tried cloning the git repository and running dune install, but then I get segfaults when trying to run Pcre matching functions. I think I have all the correct other libraries installed; it works if I instead install pcre from opam.

Two things:

(1) to install from source, you can git clone the repo, then (in the top of the source directory pcre-ocaml)

$ opam install .

(2) but beware: I found a core-dump with a fairly simple regexp, so maybe there’s a problem.

Please try again with the latest fix, and make sure that you don’t have an old version of pcre-ocaml installed on your system. The probably easiest way is to execute opam pin . in the development directory before testing. Otherwise, byte-code compiled tests could mistakenly dynamically link with an older installed version of the C-stubs, which can cause segfaults.

Works for me! Thank you!

It seems the newer PCRE2-OCaml is basically a complete rewrite so there isn’t anything to do for me there. The old PCRE-OCaml will still remain maintained for the while being, but new projects are encouraged to use more modern alternatives.

Markus pointed out to me that semgrep seems to be actively working on a PCRE2 ocaml wrapper. A bit over a year ago, @tobil4sk ported Markus’ pcre wrapper to pcre2, and I volunteered to maintain it (b/c nobody else seemed interested). But it seems that semgrep is actively working on their version, and so, if anybody from that organization sees this, and wants to take over pcre2-ocaml, I would be happy to help make that happen soonest.

I mean, I haven’t done any maintenance on it, b/c AFAICT nothing has broken but honestly, somebody who’s actively working on it would be better.