Does anyone know of an OCaml library to parse Scala programs? I’ve searched a bit on the Web but I could not even find a Yacc grammar for Scala (there’s an official EBNF spec, but not sure how easy it is to translate in a working LALR(1) grammar).
Is there are reason you need a library in OCaml? I would expect most Scala parsing tools to be written in Scala.
I can’t find any existing libraries, so if you really need to do it in OCaml, I think these are your options:
Write JNI bindings to an existing JVM library for parsing Scala. I’d imagine that the Scala compiler exposes its parser as a library. There are probably some other Scala projects that can parse Scala code too. You can use camljava to use the JNI from OCaml.
Use the ANTLR4 Scala grammar and generate C++ code. You will have to write a lot of binding code to make it usable from OCaml.
Write the parser in your OCaml yourself. I don’t think this is very easy, but neither are the other options.
I wrote a library a while ago for making Earley parsers (I think they’re pretty underrated). If you want to use it I would warn you that it is not polished and contains at least one bug. You could also try writing a specialised one for the Scala grammar, which could probably be nicer in terms of typing. Not sure how the difficulty of that would compare to (a) writing your own with a parser generator like menhir or (b) writing your own with a parser combinator library like angstrom.
This is a long shot attempt.
Look through GitHub - scalafiddle/scalafiddle-core: Source code for the scalafiddle.io compilation service to see how it accesses scala compiler from scalajs. Hack it to expose a scala parser, if it’s exposed from compiler. Define a function for converting the tree to Json. Compile it for nodejs. Use jsoo to invoke the exposed functionality. Define deserializer in ocaml.
Cool stuff. How does it differ from the other early parser ocaml libraries:
- GitHub - rlepigre/ocaml-earley: Parsing library based on Earley Algorithm
- GitHub - tomjridge/tjr_simple_earley: Simple implementation of a parsing algorithm inspired by Earley
One thing I like about yacc/menhir is that you have some static checks that your grammar is unambigous. Can you have the same kind of guarantees with your tool?
Oh cool, I wasn’t aware of those! They definitely look more polished than mine! I don’t think you do get guarantees like that; one major use of Earley parsers is in NLP where the grammars usually are ambiguous (and you want to get all parses or something). But the nice thing about them is that you can get a parser pretty directly from the BNF (my library consumes RFC5234 ABNF) so as long as you trust whoever wrote the spec you’re probably OK.
I did this in 2016 I think. It resulted in a patch to clean up the grammar (my single contribution to the Scala project):
Unfortunately I cannot now locate the parser code
I do remember that the grammar was surprisingly nice. Probably this doesn’t help much.
Try harder @thomas_ridge ! Also the lexer for Scala seems to not be trivial, especially how it must handle newlines. Do you remember troubles writing the lexer?
I asked a former Scala dev, he suggests the following:
Maybe have a look at scalaparse, a subproject of FastParse a combinator parser library.
It seems scalaparse is fairly good and complete but also reasonably simple (~700 loc) so it could serve as a starting point or serve as some kind of specification.
If you’re willing to write some Scala code (which might be easy-to-write, or there might even be a tool, there might be a way to convert a Scala parse-tree into … JSON? s-expressions? something that is easy-to-parse unambiguously.
Can you explain a little about what your goal is? Maybe there’s a way to avoid most of the parsing burden? I once needed to do some aspect-oriented code-injection for Java, and was able to get by with just lexing plus parenthesis (well, parens, brackets, braces) counting. And a regular expression to recognize method declaration headers. It was a hairy Perl regexp, but really, so much easier than writing a Java parser.
There is a maintained tree-sitter grammar for Scala and Tree-Sitter is the pure C library, so you could just use FFI to call it.
You could even create a generic OCaml binding for the Tree-Sitter and publish it to the official GitHub organization. Good examples in similar languages are:
the people from onivim wrote some bindings oni2/src/reason-tree-sitter at master · onivim/oni2 · GitHub
Yes, I know about tree-sitter-scala, @mjambon actually extended reason-tree-sitter and wrote a binding for tree-sitter ( GitHub - returntocorp/ocaml-tree-sitter: Generate OCaml parsers based on tree-sitter grammars. ) but in our experiments tree-sitter-scala could not parse most scala projects out there, there was a 85% parsing success rate. We could extend tree-sitter-scala, and we tried a little, but the grammar in tree-sitter-scala differs quite a bit from the official EBNF spec, so I was wondering if someone wrote a working, complete, scala parser in OCaml. Apparently someone did, @thomas_ridge, but he lost the code … I guess scalaparse from fastparse mentioned before could be a good starting point. Thanks everybody!
I just came across ocaml-tree-sitter the other day. How battle hardened are the tree-sitter parsers currently? The last time I played around with some of them they were not complete. Clearly the original purpose of tree-sitter was for a better text editor experience. My guess is that being 85% complete is probably enough to greatly improve text edition but isn’t so great if you are trying to use it to parse and generate code. The last few percent are the hard miles in my experience when trying to create a complete parser for a language… all the weird and wonderful corner cases rear their heads.
It depends on the language and how much work was put into it. One nice thing about tree-sitter is error recovery. Most often, parsing errors due to unsupported syntax can be skipped over, affecting only a small region of the source file. This is useful for code browsing and code search. With error recovery, a given file could be parsed with 90% success rather than 0%. That’s how we get to 99.8% success overall for typescript.
Even tree-sitter grammars that are 100% complete are not necessarily usable for code generation. tree-sitter-ocaml parses all valid OCaml code (as far as I know, there may be bugs that I didn’t find yet), but it parses some invalid code too (which can be a problem for some applications), and may attach attributes to wrong nodes.