Regular Expression

Hello,
So I wanted to be able to convert a string input into a string list, with a separation matching a certain pattern. I found out about the Srt module and its regexp functionnality wich seemed to be the thing I was looking for.
But I can’t quite understand how to use it.
Example

I have a list of keyword to match :
let lst = ["pig"; "cow"; "cat"]
I figured I could convert it into a regexp via :
let reg = Str.regexp (String.concat "//|" lst)
Using the "//|" as the “alternative” operator and then use
let lst2 = Str.full_split reg "pigcowcat"
But has no effect.

1 Like

Try changing slashes by backslashes.
And would I recommend testing on string "111pig222cow333cat444".

1 Like

Nice it worked, thanks.
But now i need the regular expression for positive float/int, and i’ve came up with this :
\\([0-9]*.?[0-9]*\\)
But matches everything except the first character for some reason.
In the concept it should match :
90, 0.9, .9, .90, 90. but also .
How can I prevent the last case?

I just figured that . is used for mathing any character so we need to use \\ postfixed to match . :
\\([0-9]*.\\?[0-9]*\\)
But still, it doesn’t match integers or floats.

I would try prefixed, not postfixed form

Thanks, the doc is very confusu=ing sometimes.
This will work for matching postive int/float :
"[0-9]+\\.?[0-9]*\\|\\.[0-9]+"

when it comes to conversions, I used tyre by @Drup with great pleasure. mro/Tagger: ♊️ Mirror of https://code.mro.name/mro/Tagger | 🐫 Add, delete and list tags of files stored in filenames. - lib/name.ml at master - Tagger - Codeberg.org was a first try, might make it simpler today as I learnt going along. Really nice is the ‘unparse’ you won’t see elsewhere.

2 Likes

Just a tiny, minor suggestion (actually two):

  1. I find that using Perl to test out regexps is nice, b/c i can do so on the command-line, viz.
echo 0.0 | perl -n -e 'print $_ if $_ =~ m/[0-9]+\.?[0-9]*|\.[0-9]+/'
  1. and then, I use a raw-string constant to write the string, viz
{|0-9]+\.?[0-9]*|\.[0-9]+|}

so I don’t need to escape backslashes.

[you’ll notice that I didn’t escape the “|”, but that’s b/c perl regexps don’t require it (and hence, PCRE won’t either)]

1 Like

once the floats float (beyond being fixed-point decimals), they tend to look like -9.37e-5.

1 Like

the last few times I needed a regexp for float literals, I went and copied the one out of the OCaml compiler source (parsing/lexer.mll, search for “float_literal”). It’s not the same syntax as any particular regex engine (at least, not that I know of) but should be straightforward enough to transliterate.

1 Like

Parsing ints and floats is trickier than it seems at first glance. I went through a lot of iterations to account for positive/negative, floats, ints, NaN, infinity, scientific notation etc. to settle on my current code, ocaml-decimal/decimal.ml at 08c57183c8673b5058bd6010570e69f0201c03c7 · yawaramin/ocaml-decimal · GitHub

The core regex is:

let finite_r = Str.regexp {|^[+-]?[0-9]*\.?[0-9]*\(e[+-]?[0-9]+\)?$|}

Note, if you use a verbatim string literal {|...|} instead of "...", then you don’t need to escape your backslashes. EDIT: whoops, Chet already mentioned that.

Yawar, I don’t know the context in which you use this, but … doesn’t this regexp match the string "." and also "" ? Maybe it’s really late and I’m not thinking straight …

I remember when I was writing a JSON parser that JS accepts .0 as a float, but OCaml does not, and only accepts 0.0.

ETA: and also the string "e0" ?

uh, just realised nobody brought up xkcd-wisdom so far: xkcd: Perl Problems

1 Like

Oh come now. regexps are an amazing tool, and I routinely use them to solve problems that would require significant amounts of code. By shrinking the code down to a single line, it’s actually more comprehensible and easier-to-check. Sure, then you have to really stare at it, but I’m quite convinced that it’s better than a pile of parsing logic.

I wrote a regexp once that exactly parsed an XML start-tag. And used that in a regexp that parsed the next syntactic element (start/end-tag, xmldecl, chars, char-escape) in XML. Was much, much better than something lower-level.

I mean, here’s one: a regexp (for sedlex) for JSON floating-point numbers (read right off the spec, IIRC):

let digit = [%sedlex.regexp? '0'..'9']
let int = [%sedlex.regexp? '0' | ( ('1'..'9') , (Star digit) )]
let frac = [%sedlex.regexp? '.' , (Star digit)]
let ne_frac = [%sedlex.regexp? '.' , (Plus digit)]
let exp = [%sedlex.regexp? ('e' | 'E') , (Opt ('-' | '+')) , (Plus digit)]
let json_number = [%sedlex.regexp? (Opt '-') , int, Opt ne_frac, Opt exp]
3 Likes

Here’s another example from 2020.

This problem started out as a complex perl script that worked char-by-char, and eventually it boiled-down to basically a small and simple program that used a -really- big regexp to solve the same problem. Much faster, too.
https://discuss.ocaml.org/t/re-pattern-matching-on-a-lazy-list/6159/2

Hi Chet, you’re mostly right. I deliberately allow matching "." and "" and parse them into 0. "e0" throws an exception–there’s other code that handles parsing the string into a valid decimal.

1 Like

That’s quite usefull indeed.

In fact, the input will be user input and should match the mathematical conventions, which includes the constant e, making this writing of floats unallowed, due to evident possible misinterpretation.

No need to suggest me anything more, I’m done with regexp for now since it did its job wonderfully.

@Nolord don’t confuse the Euler-number e with the indicator meaning “take the following as exponent to the base 10”. And be careful – not few bark on leading explicit plus signs.

@Chet_Murthy in fact I love regexps. They’re like salt to me – all over the place but I’m careful about the quantities.

3 Likes

God I love 'em!

print "prime\n" if $x !~ m,^(11+)\1+$, ;

[courtesy of Jon Orwant >20yr ago]

2 Likes

There never was confusion, just saying that how can a algorithm determine if -9.37e-5 is -9.37*10^-5 or -9.37*e - 5? It can’t because both make sense, so I chose the second option since you can’t express “e” in a different manner.