Full_split in pcre

I don’t know if this is a good place to ask this, but I’m having trouble understanding what Pcre.full_split is supposed to do, in particular what the Group and NoGroup elements in its return value mean.

The ocaml-pcre documentation doesn’t give me any help – it just says “Should behave exactly as in PERL” – and the pcre documentation doesn’t mention a split function at all. Perl doesn’t have a full_split, but as described in perlfunc it does automatically include the values of matching groups from the regex. So, with the documented behavior of Str.full_split in mind, which returns the actual values of the delimiter strings with a tag on them, I figure Pcre.full_split is supposed to do something like this.

After some experimentation, it seems that Pcre.full_split returns a Text value for every text string in between delimiters and a Delim value for each delimiter itself, along with for each (...) subgroup in the regex either a Group or a NoGroup value, in order following the Delim. So, for instance,

# full_split ~pat:"([xy])([uv])" "abxucd";;
- : split_result list =
[Text "ab"; Delim "xu"; Group (1, "x"); Group (2, "u"); Text "cd"]

where the single delimiter xu becomes a Delim giving the full value of the delimiter string, plus Groups giving the values of the two subgroups in the regex.

So far so good. What I’m confused about is what happens when a particular subgroup doesn’t match (because it is part of an alternative not taken). In this case, Perl returns undef in the relevant location, and my guess was that NoGroup plays this role in Pcre. Sometimes this is right:

# full_split ~pat:"(x)|(u)" "abxcd";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"]

# full_split ~pat:"(x)|(u)" "abucd";;
- : split_result list =
[Text "ab"; Delim "u"; NoGroup; Group (2, "u"); Text "cd"]

Oddly, NoGroup isn’t tagged by the number of the group the way Group is (though they always seem to appear in order, so in both cases the number is strictly speaking redundant). But more importantly, sometimes a non-matching group seems to be represented by a Group with the empty string rather than a NoGroup:

# full_split ~pat:"(x)|(u)" "abxcduef";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"; Delim "u";
 Group (1, ""); Group (2, "u"); Text "ef"]

Why does it give Group (1, "") instead of NoGroup? Is this a bug?

When I was implementing pa_ppx_regexp I ran into these perplexing issues also. And what’s worse, pcre differs from re. Joy. And for sure, I think re does it better than pcre does.

So I implemented my own version of full_split[1]:

# [%split {|(x)|(u)|} / strings re_perl] "abxcd";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
= [`Text "ab"; `Delim ("x", Some "x", None); `Text "cd"]
# [%split {|(x)|(u)|} / strings pcre] "abxcd";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
= [`Text "ab"; `Delim ("x", Some "x", None); `Text "cd"]
# [%split {|(x)|(u)|} / strings re_perl] "abxcduef";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
=
[`Text "ab"; `Delim ("x", Some "x", None); `Text "cd";
 `Delim ("u", None, Some "u"); `Text "ef"]
# [%split {|(x)|(u)|} / strings pcre] "abxcduef";;
- : [> `Delim of string * string option * string option | `Text of string ]
    list
=
[`Text "ab"; `Delim ("x", Some "x", None); `Text "cd";
 `Delim ("u", None, Some "u"); `Text "ef"]
# 

I don’t remember whether I dug into the pcre code to determine whether my work-alike was slower than the pcre version; at some point I should do that, I guess.

[1] https://github.com/camlp5/pa_ppx_regexp/blob/master/runtime/pa_ppx_regexp_runtime.ml

1 Like

I have been debugging your problem, and I believe that there’s a bug in the wrapper code of pcre. I’ll confirm and send a PR over to the current maintainer.

FYI: I opened an issue: A possible bug in full_split ? · Issue #29 · mmottl/pcre-ocaml · GitHub . Please feel free to comment, etc.

3 Likes

Makes sense, thanks!