I don’t know if this is a good place to ask this, but I’m having trouble understanding what
Pcre.full_split is supposed to do, in particular what the
NoGroup elements in its return value mean.
The ocaml-pcre documentation doesn’t give me any help – it just says “Should behave exactly as in PERL” – and the pcre documentation doesn’t mention a split function at all. Perl doesn’t have a
full_split, but as described in perlfunc it does automatically include the values of matching groups from the regex. So, with the documented behavior of Str.full_split in mind, which returns the actual values of the delimiter strings with a tag on them, I figure
Pcre.full_split is supposed to do something like this.
After some experimentation, it seems that
Pcre.full_split returns a
Text value for every text string in between delimiters and a
Delim value for each delimiter itself, along with for each
(...) subgroup in the regex either a
Group or a
NoGroup value, in order following the
Delim. So, for instance,
# full_split ~pat:"([xy])([uv])" "abxucd";; - : split_result list = [Text "ab"; Delim "xu"; Group (1, "x"); Group (2, "u"); Text "cd"]
where the single delimiter
xu becomes a
Delim giving the full value of the delimiter string, plus
Groups giving the values of the two subgroups in the regex.
So far so good. What I’m confused about is what happens when a particular subgroup doesn’t match (because it is part of an alternative not taken). In this case, Perl returns
undef in the relevant location, and my guess was that
NoGroup plays this role in Pcre. Sometimes this is right:
# full_split ~pat:"(x)|(u)" "abxcd";; - : split_result list = [Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"] # full_split ~pat:"(x)|(u)" "abucd";; - : split_result list = [Text "ab"; Delim "u"; NoGroup; Group (2, "u"); Text "cd"]
NoGroup isn’t tagged by the number of the group the way
Group is (though they always seem to appear in order, so in both cases the number is strictly speaking redundant). But more importantly, sometimes a non-matching group seems to be represented by a
Group with the empty string rather than a
# full_split ~pat:"(x)|(u)" "abxcduef";; - : split_result list = [Text "ab"; Delim "x"; Group (1, "x"); NoGroup; Text "cd"; Delim "u"; Group (1, ""); Group (2, "u"); Text "ef"]
Why does it give
Group (1, "") instead of
NoGroup? Is this a bug?