Regular Expression

Nolord · January 23, 2022, 11:15am

Hello,
So I wanted to be able to convert a string input into a string list, with a separation matching a certain pattern. I found out about the Srt module and its regexp functionnality wich seemed to be the thing I was looking for.
But I can’t quite understand how to use it.
Example

I have a list of keyword to match :
let lst = ["pig"; "cow"; "cat"]
I figured I could convert it into a regexp via :
let reg = Str.regexp (String.concat "//|" lst)
Using the "//|" as the “alternative” operator and then use
let lst2 = Str.full_split reg "pigcowcat"
But has no effect.

Kakadu · January 23, 2022, 11:30am

Try changing slashes by backslashes.
And would I recommend testing on string "111pig222cow333cat444".

Nolord · January 23, 2022, 12:15pm

Nice it worked, thanks.
But now i need the regular expression for positive float/int, and i’ve came up with this :
\\([0-9]*.?[0-9]*\\)
But matches everything except the first character for some reason.
In the concept it should match :
90, 0.9, .9, .90, 90. but also .
How can I prevent the last case?

Nolord · January 23, 2022, 1:06pm

I just figured that . is used for mathing any character so we need to use \\ postfixed to match . :
\\([0-9]*.\\?[0-9]*\\)
But still, it doesn’t match integers or floats.

Kakadu · January 23, 2022, 3:01pm

I would try prefixed, not postfixed form

Nolord · January 23, 2022, 3:28pm

Thanks, the doc is very confusu=ing sometimes.
This will work for matching postive int/float :
"[0-9]+\\.?[0-9]*\\|\\.[0-9]+"

mro · January 23, 2022, 3:30pm

when it comes to conversions, I used tyre by @Drup with great pleasure. mro/Tagger: ♊️ Mirror of https://code.mro.name/mro/Tagger | 🐫 Add, delete and list tags of files stored in filenames. - lib/name.ml at master - Tagger - Codeberg.org was a first try, might make it simpler today as I learnt going along. Really nice is the ‘unparse’ you won’t see elsewhere.

Chet_Murthy · January 24, 2022, 6:16am

Just a tiny, minor suggestion (actually two):

I find that using Perl to test out regexps is nice, b/c i can do so on the command-line, viz.

echo 0.0 | perl -n -e 'print $_ if $_ =~ m/[0-9]+\.?[0-9]*|\.[0-9]+/'

and then, I use a raw-string constant to write the string, viz

{|0-9]+\.?[0-9]*|\.[0-9]+|}

so I don’t need to escape backslashes.

[you’ll notice that I didn’t escape the “|”, but that’s b/c perl regexps don’t require it (and hence, PCRE won’t either)]

mro · January 24, 2022, 9:37am

once the floats float (beyond being fixed-point decimals), they tend to look like -9.37e-5.

Chet_Murthy · January 25, 2022, 2:47am

the last few times I needed a regexp for float literals, I went and copied the one out of the OCaml compiler source (parsing/lexer.mll, search for “float_literal”). It’s not the same syntax as any particular regex engine (at least, not that I know of) but should be straightforward enough to transliterate.

yawaramin · January 25, 2022, 3:21am

Parsing ints and floats is trickier than it seems at first glance. I went through a lot of iterations to account for positive/negative, floats, ints, NaN, infinity, scientific notation etc. to settle on my current code, ocaml-decimal/decimal.ml at 08c57183c8673b5058bd6010570e69f0201c03c7 · yawaramin/ocaml-decimal · GitHub

The core regex is:

let finite_r = Str.regexp {|^[+-]?[0-9]*\.?[0-9]*\(e[+-]?[0-9]+\)?$|}

Note, if you use a verbatim string literal {|...|} instead of "...", then you don’t need to escape your backslashes. EDIT: whoops, Chet already mentioned that.

Chet_Murthy · January 25, 2022, 7:14am

Yawar, I don’t know the context in which you use this, but … doesn’t this regexp match the string "." and also "" ? Maybe it’s really late and I’m not thinking straight …

I remember when I was writing a JSON parser that JS accepts .0 as a float, but OCaml does not, and only accepts 0.0.

ETA: and also the string "e0" ?

mro · January 25, 2022, 8:41am

uh, just realised nobody brought up xkcd-wisdom so far: xkcd: Perl Problems

Chet_Murthy · January 25, 2022, 8:47am

Oh come now. regexps are an amazing tool, and I routinely use them to solve problems that would require significant amounts of code. By shrinking the code down to a single line, it’s actually more comprehensible and easier-to-check. Sure, then you have to really stare at it, but I’m quite convinced that it’s better than a pile of parsing logic.

I wrote a regexp once that exactly parsed an XML start-tag. And used that in a regexp that parsed the next syntactic element (start/end-tag, xmldecl, chars, char-escape) in XML. Was much, much better than something lower-level.

I mean, here’s one: a regexp (for sedlex) for JSON floating-point numbers (read right off the spec, IIRC):

let digit = [%sedlex.regexp? '0'..'9']
let int = [%sedlex.regexp? '0' | ( ('1'..'9') , (Star digit) )]
let frac = [%sedlex.regexp? '.' , (Star digit)]
let ne_frac = [%sedlex.regexp? '.' , (Plus digit)]
let exp = [%sedlex.regexp? ('e' | 'E') , (Opt ('-' | '+')) , (Plus digit)]
let json_number = [%sedlex.regexp? (Opt '-') , int, Opt ne_frac, Opt exp]

Chet_Murthy · January 25, 2022, 9:05am

Here’s another example from 2020.

This problem started out as a complex perl script that worked char-by-char, and eventually it boiled-down to basically a small and simple program that used a -really- big regexp to solve the same problem. Much faster, too.
https://discuss.ocaml.org/t/re-pattern-matching-on-a-lazy-list/6159/2

yawaramin · January 25, 2022, 3:52pm

Hi Chet, you’re mostly right. I deliberately allow matching "." and "" and parse them into 0. "e0" throws an exception–there’s other code that handles parsing the string into a valid decimal.

Nolord · January 25, 2022, 6:16pm

That’s quite usefull indeed.

In fact, the input will be user input and should match the mathematical conventions, which includes the constant e, making this writing of floats unallowed, due to evident possible misinterpretation.

No need to suggest me anything more, I’m done with regexp for now since it did its job wonderfully.

mro · January 25, 2022, 7:21pm

@Nolord don’t confuse the Euler-number e with the indicator meaning “take the following as exponent to the base 10”. And be careful – not few bark on leading explicit plus signs.

@Chet_Murthy in fact I love regexps. They’re like salt to me – all over the place but I’m careful about the quantities.

Chet_Murthy · January 25, 2022, 7:36pm

God I love 'em!

print "prime\n" if $x !~ m,^(11+)\1+$, ;

[courtesy of Jon Orwant >20yr ago]

Nolord · January 25, 2022, 7:40pm

There never was confusion, just saying that how can a algorithm determine if -9.37e-5 is -9.37*10^-5 or -9.37*e - 5? It can’t because both make sense, so I chose the second option since you can’t express “e” in a different manner.

Topic		Replies	Views
How to convert string to type "regexp"? Learning string	4	1288	June 22, 2020
Regexp solution for Advent for Code 2024 Day 3 Learning regexp	3	158	December 8, 2024
Get all regexp matches from a string Learning regexp	7	1374	August 27, 2022
[help] Regex pattern and raw strings Learning string , pattern , str , regexp	2	462	March 22, 2023
Different regex rule between Python & OCaml (kinda) confuses me lol Learning string , python , str , regexp	3	683	March 30, 2023

Regular Expression

Related topics