String manipulation

Hello,

I have a below string:

let str = "MACROBUTTON AbaisserEnCorpsDeTexte \"[Click here and insert a PICTURE (mandatory)]\""

I want to extract “[Click here and insert a PICTURE (mandatory)]” from this string.
Is it possible to do it with only one or two line of OCaml code?

Thank you in advance.

Certainly. The simplest code that respects your specification:

let str = "[Click here and insert a PICTURE (mandatory)]"

If your specification was, in fact, extract the first string between ":

let extract s = match String.split_on_char '"' s with
  |  _ :: s :: _ -> Some s | _ -> None

Or if you wanted all strings between "

let extract s = snd @@ List.fold_left (fun (p,l) x ->  not p,
  if p then x :: l else l)  (false, [])@@  String.split_on_char '"' s 

Note that the unnatural constraint 1 or 2 line requires unnatural code indentation.

In ocaml 4.11 all strings between " can be gotten with

let get s = String.split_on_char '"' s |> List.filteri (fun i _ -> i mod 2 = 1)

Thank you for the feedback.

I think my requirement was not clear.
I need to get third field from the string.
For example, my string is: "MACROBUTTON AbaisserEnCorpsDeTexte \"[Click here and insert a PICTURE (mandatory)]\""

and third field from this string is “[Click here and insert a PICTURE (mandatory)]

I do not want to extract string between “” This third field can be anything. This is just en example.

Another examples of strings are:
"MACROBUTTON AbaisserEnCorpsDeTexte ..."
"MACROBUTTON CheckFail PASS" etc.

In above examples, third fields are “…” and “PASS”

Then you can just split on ' ', discards the first two elements and concat the rest. Or you can scan with scanf,

let extract s = Scanf.sscanf s "%s %s %s" (fun _ _ z -> z)

or write a parser with angstrom, or … . I am not sure what is the issue here.

1 Like

(1) If your string is really a record of a sort, with space-separated fields, and each field is either no-white-space, or if it has whitespace, is double-quoted, AND if the number of fields is FIXED, I would suggest you use Pcre (or Str) and write a regexp with capture groups (the parentheses). That’ll get you whichever field you wish.

(2) If the above is true, except that the # of fields is unbounded, then you’ll probably need to write something that iterates down the string to the field of interest. You can do that with Pcre again, since you can do a match from a starting-position.

This will require that you carefully specify the syntax of fields. So for instance, if the double-quoted field can itself contain escaped double-quotes, you’ll need to make sure your regex accounts for that.

If you’re working on a lot of problems like this, I think learning how to use regular expressions is going to be really, really valuable. And this will be true regardless of which language you choose: indeed, in perl/python/ruby/etc, you’ll need them even moreso than in Oaml.

Hope this helps.

2 Likes

Hello,
I tried with this string manipulation with PCRE regular expression. But, I end up with incomplete regular expression.

Regular expression I tried to match MACROBUTTON fields is:
[^ ]+

But , as you can quickly guess, it breaks when the third field of MACROBUTTON has space. (e.g. “[Click here and insert a PICTURE (mandatory)] ” )

Could you please suggest some improvements on this regular expression?
Sorry for the very late reply.

Thank you in advance.

the regexp you cite is an attempt to match the -separator- between fields. But that separator appears in fields. I’d suggest you first write a regexp to match a field, and then string them together with the regexp to match the separator. What is a regexp that matches the entire line?

At this point, it might be useful to back up and work thru the lexical analysis chapter in a good compilers textbook, and/or the O’Reilly book on regexps.

If your fields are separated with a single space character, then the following simple function could be utilized,

let extract s = match String.split_on_char ' ' s with
    | _ :: _ :: rest -> String.concat " " rest
    | _ -> failwith "invalid format"

The idea, is to split the sentence into words, then drop the first two, and build the sentence back from all the words except the first two.

No, fields are separated by one or more space characters.