Recommendations for parsing structured text files with specific annotations/commands

Hello,

I would like to parse structured text files with some specific commands inside them to obtain an OCaml Sum type that I will use in my program. For now, I’m using text files in markdown format with additionally LaTeX-like commands (e.g. \mycommand{arg1, arg2}).

For parsing markdown I plan to use omd with some regexp to parse my commands.

Do you have any recommendations regarding libraries/frameworks to use for such a job? As anybody done similar things?

I’m not specially tied to Markdown or Latex-like commands and will happily switch to other formats if a library already provides necessary parsing.

Best regards,
david

I would look into Angstrom. It is very flexible and combines traditional scanning and parsing. Classic scanning/parsing using lex/yacc only works with languages that are designed that way. Many real-world languages don’t separate scanning and parsing well enough.

I tend to use MParser if you want to retain line number information, but Angstrom is perfectly good indeed.

If you can arrange for prefix/suffix strings to come from some well-understood fixed and small set, and that they are not present in the text of your specific commands, I’d go with regexp to pull out your specific commands as strings, and then … well, anything you want to parse them.

If your example \mycommand{arg1,arg2} is an example of what you need to parse, then it’s harder, b/c “}” is too common. But perhaps if you just scan for parenthesis-matching, you can use that to know where the end of the command is.

What I’m saying is: it can be a lot of trouble to parse your entire file using some parser, only to extract some little bits. In your case, that would be running a markdown parser. So if there’s a way to regard the file as just text, and find your particular strings in some other way, that can be effective.

At some point I had a similar idea of of porting my blog to Scribble language that looks like this:

1 Getting Started (racket-lang.org)

#lang scribble/base
 
@title{On the Cookie-Eating Habits of Mice}
 
If you give a mouse a cookie, he's going to ask for a
glass of milk.
 
@include-section["milk.scrbl"]
@include-section["straw.scrbl"]

Of course, to do that, I started with writing a parser for Scribble in OCaml. And to do that I—of course—started with making a parser combinator library… I got quite far, but abandoned it in favor of using Pandoc.

Here’s my incomplete Scribble parser, if you’re looking for an inspiration: new.keleshev.com/scribble at master · keleshev/new.keleshev.com · GitHub

Thank you @Chet_Murthy for pointing out that the commands should be different enough of the regular text to be easy to parse! I think this is my case but I’ll keep that in mind.

Thank you @lindig, @darrenldl and @keleshev for your suggestions. It seems to me a bit overkill for now but I’ll keep that in mind (Angstrom, MParser) in case my first approach does not work.

2 Likes