Hello,
I am new to ocamllex lexer generators.
I am trying to capture raw string literal in C++ version 11 using ocamllex. [https://en.cppreference.com/w/cpp/language/string_literal]
I am using below regular expression to achieve this:
(['L' 'u' 'U']|""|"u8")? 'R' '"' ([^ '(' ')' '\\' ' ']*) '(' _* ')' ([^ '(' ')' '\\' ' ']*) '"'
But _* is something which is not good. I am not able to capture all the raw literal string from below c++ code.
#include <iostream>
using namespace std;
void testCpp11String1() {
cout << R"-DELIM-(I can write
a multi-line
string ()
")-DELIM-" << endl;
}
void testCpp11String2() {
cout << R"-DELIM-(I can put double quotes like this " " in the string as long as it is not terminated by )delimiter" where delimiter is -DELIM-)-DELIM-" << endl; // there is an odd number of double quotes in this line
}
int main() {
testCpp11String1();
testCpp11String2();
return 0;
}
Please give me some ideas.
Thank you in advance
I think your regular expression, among other things, will fail to make sure the end delimiter is the same as the first one. This is something a regular expression (without things like back-references which I think ocamllex regexes don’t have) cannot capture.
The standard thing to do in ocamllex to handle things like string literals or comments is to define a separate entry point. It’s more powerful and readable than a regex, and it can handle what we want here. What I propose below is just a skeleton, you’ll want to tune it a lot to handle new lines and escapes etc etc.
let delimiter = (* a regex for valid delimiters here *)
rule token = parse
...
| "R\"" (delimiter as delim) '('
{ raw_literal delim (Buffer.create 32) lexbuf }
...
and raw_literal delim buf = parse
| [^')']*
{
Buffer.add_string buf (Lexing.lexeme lexbuf);
raw_literal delim buf lexbuf
}
| ')' (delimiter as delim2) '"'
{
if delim = delim2 then
RAW_LIT (Buffer.contents buf)
else begin
Buffer.add_string buf (Lexing.lexeme lexbuf);
raw_literal delim buf lexbuf
end
}
| ')'
{
Buffer.add_char buf ')';
raw_literal delim buf lexbuf
}
I need to use a buffer to keep track of what we’ve seen so far, since we want a single token at the end, maybe there is another way but I don’t know.
It sounds easier to me to handle the type prefixes in the parser, that way you can factorize with other kinds of strings, so I didn’t include them. Maybe I’m wrong.
Hope I helped. There’s a nice tuto about ocamllex here.
1 Like
Hello “threepwood”,
Thank you very much for your detailed reply. It is very much working!