I don’t know if this is a good place to ask this, but I’m having trouble understanding what Pcre.full_split
is supposed to do, in particular what the Group
and NoGroup
elements in its return value mean.
The ocaml-pcre documentation doesn’t give me any help – it just says “Should behave exactly as in PERL” – and the pcre documentation doesn’t mention a split function at all. Perl doesn’t have a full_split
, but as described in perlfunc it does automatically include the values of matching groups from the regex. So, with the documented behavior of Str.full_split in mind, which returns the actual values of the delimiter strings with a tag on them, I figure Pcre.full_split
is supposed to do something like this.
After some experimentation, it seems that Pcre.full_split
returns a Text
value for every text string in between delimiters and a Delim
value for each delimiter itself, along with for each (...)
subgroup in the regex either a Group
or a NoGroup
value, in order following the Delim
. So, for instance,
# full_split ~pat:"([xy])([uv])" "abxucd";;
- : split_result list =
[Text "ab"; Delim "xu"; Group (1, "x"); Group (2, "u"); Text "cd"]
where the single delimiter xu
becomes a Delim
giving the full value of the delimiter string, plus Group
s giving the values of the two subgroups in the regex.
So far so good. What I’m confused about is what happens when a particular subgroup doesn’t match (because it is part of an alternative not taken). In this case, Perl returns undef
in the relevant location, and my guess was that NoGroup
plays this role in Pcre. Sometimes this is right:
# full_split ~pat:"(x)|(u)" "abxcd";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"]
# full_split ~pat:"(x)|(u)" "abucd";;
- : split_result list =
[Text "ab"; Delim "u"; NoGroup; Group (2, "u"); Text "cd"]
Oddly, NoGroup
isn’t tagged by the number of the group the way Group
is (though they always seem to appear in order, so in both cases the number is strictly speaking redundant). But more importantly, sometimes a non-matching group seems to be represented by a Group
with the empty string rather than a NoGroup
:
# full_split ~pat:"(x)|(u)" "abxcduef";;
- : split_result list =
[Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"; Delim "u";
Group (1, ""); Group (2, "u"); Text "ef"]
Why does it give Group (1, "")
instead of NoGroup
? Is this a bug?