Lexer rule for C++ 11 raw string literal

Hello,

I am new to ocamllex lexer generators.
I am trying to capture raw string literal in C++ version 11 using ocamllex. [https://en.cppreference.com/w/cpp/language/string_literal]
I am using below regular expression to achieve this:
(['L' 'u' 'U']|""|"u8")? 'R' '"' ([^ '(' ')' '\\' ' ']*) '(' _* ')' ([^ '(' ')' '\\' ' ']*) '"'

But _* is something which is not good. I am not able to capture all the raw literal string from below c++ code.

#include <iostream>

using namespace std;

void testCpp11String1() {
	cout << R"-DELIM-(I can write
a multi-line
string ()
")-DELIM-" << endl;
}

void testCpp11String2() {
	cout << R"-DELIM-(I can put double quotes like this " " in the string as long as it is not terminated by )delimiter" where delimiter is -DELIM-)-DELIM-" << endl; // there is an odd number of double quotes in this line
}

int main() {
	testCpp11String1();
	testCpp11String2();
	return 0;
} 

Please give me some ideas.
Thank you in advance

I think your regular expression, among other things, will fail to make sure the end delimiter is the same as the first one. This is something a regular expression (without things like back-references which I think ocamllex regexes don’t have) cannot capture.

The standard thing to do in ocamllex to handle things like string literals or comments is to define a separate entry point. It’s more powerful and readable than a regex, and it can handle what we want here. What I propose below is just a skeleton, you’ll want to tune it a lot to handle new lines and escapes etc etc.

let delimiter = (* a regex for valid delimiters here *)

rule token = parse
  ...
  | "R\"" (delimiter as delim) '('
    { raw_literal delim (Buffer.create 32) lexbuf }
  ...
and raw_literal delim buf = parse
  |  [^')']*
    {
      Buffer.add_string buf (Lexing.lexeme lexbuf);
      raw_literal delim buf lexbuf
    }
  |  ')' (delimiter as delim2) '"'
    {
      if delim = delim2 then
        RAW_LIT (Buffer.contents buf)
      else begin
        Buffer.add_string buf (Lexing.lexeme lexbuf);
        raw_literal delim buf lexbuf
      end
    }
| ')'
  {
    Buffer.add_char buf ')';
    raw_literal delim buf lexbuf
  }

I need to use a buffer to keep track of what we’ve seen so far, since we want a single token at the end, maybe there is another way but I don’t know.

It sounds easier to me to handle the type prefixes in the parser, that way you can factorize with other kinds of strings, so I didn’t include them. Maybe I’m wrong.

Hope I helped. There’s a nice tuto about ocamllex here.

1 Like

Hello “threepwood”,

Thank you very much for your detailed reply. It is very much working! :slightly_smiling_face: