Parser for the Scala programming language?

aryx · March 26, 2021, 12:12pm

Does anyone know of an OCaml library to parse Scala programs? I’ve searched a bit on the Web but I could not even find a Yacc grammar for Scala (there’s an official EBNF spec, but not sure how easy it is to translate in a working LALR(1) grammar).

mnxn · March 26, 2021, 4:01pm

Is there are reason you need a library in OCaml? I would expect most Scala parsing tools to be written in Scala.

I can’t find any existing libraries, so if you really need to do it in OCaml, I think these are your options:

Write JNI bindings to an existing JVM library for parsing Scala. I’d imagine that the Scala compiler exposes its parser as a library. There are probably some other Scala projects that can parse Scala code too. You can use camljava to use the JNI from OCaml.
Use the ANTLR4 Scala grammar and generate C++ code. You will have to write a lot of binding code to make it usable from OCaml.
Write the parser in your OCaml yourself. I don’t think this is very easy, but neither are the other options.

roddy · March 27, 2021, 1:37am

I wrote a library a while ago for making Earley parsers (I think they’re pretty underrated). If you want to use it I would warn you that it is not polished and contains at least one bug. You could also try writing a specialised one for the Scala grammar, which could probably be nicer in terms of typing. Not sure how the difficulty of that would compare to (a) writing your own with a parser generator like menhir or (b) writing your own with a parser combinator library like angstrom.

kanishka · March 27, 2021, 2:20am

This is a long shot attempt.

Look through GitHub - scalafiddle/scalafiddle-core: Source code for the scalafiddle.io compilation service to see how it accesses scala compiler from scalajs. Hack it to expose a scala parser, if it’s exposed from compiler. Define a function for converting the tree to Json. Compile it for nodejs. Use jsoo to invoke the exposed functionality. Define deserializer in ocaml.

aryx · March 27, 2021, 9:53am

Cool stuff. How does it differ from the other early parser ocaml libraries:

GitHub - rlepigre/ocaml-earley: Parsing library based on Earley Algorithm
GitHub - tomjridge/tjr_simple_earley: Simple implementation of a parsing algorithm inspired by Earley

One thing I like about yacc/menhir is that you have some static checks that your grammar is unambigous. Can you have the same kind of guarantees with your tool?

roddy · March 27, 2021, 12:20pm

Oh cool, I wasn’t aware of those! They definitely look more polished than mine! I don’t think you do get guarantees like that; one major use of Earley parsers is in NLP where the grammars usually are ambiguous (and you want to get all parses or something). But the nice thing about them is that you can get a parser pretty directly from the BNF (my library consumes RFC5234 ABNF) so as long as you trust whoever wrote the spec you’re probably OK.

thomas_ridge · March 27, 2021, 2:08pm

I did this in 2016 I think. It resulted in a patch to clean up the grammar (my single contribution to the Scala project):

Unfortunately I cannot now locate the parser code

I do remember that the grammar was surprisingly nice. Probably this doesn’t help much.

aryx · March 27, 2021, 3:28pm

Try harder @thomas_ridge ! Also the lexer for Scala seems to not be trivial, especially how it must handle newlines. Do you remember troubles writing the lexer?

dbuenzli · March 27, 2021, 5:11pm

I asked a former Scala dev, he suggests the following:

Maybe have a look at scalaparse, a subproject of FastParse a combinator parser library.

It seems scalaparse is fairly good and complete but also reasonably simple (~700 loc) so it could serve as a starting point or serve as some kind of specification.

HTH.

Chet_Murthy · March 28, 2021, 4:55pm

If you’re willing to write some Scala code (which might be easy-to-write, or there might even be a tool, there might be a way to convert a Scala parse-tree into … JSON? s-expressions? something that is easy-to-parse unambiguously.

Can you explain a little about what your goal is? Maybe there’s a way to avoid most of the parsing burden? I once needed to do some aspect-oriented code-injection for Java, and was able to get by with just lexing plus parenthesis (well, parens, brackets, braces) counting. And a regular expression to recognize method declaration headers. It was a hairy Perl regexp, but really, so much easier than writing a Java parser.

XVilka · March 29, 2021, 5:50am

There is a maintained tree-sitter grammar for Scala and Tree-Sitter is the pure C library, so you could just use FFI to call it.
You could even create a generic OCaml binding for the Tree-Sitter and publish it to the official GitHub organization. Good examples in similar languages are:

Khady · March 29, 2021, 6:21am

the people from onivim wrote some bindings oni2/src/reason-tree-sitter at master · onivim/oni2 · GitHub

aryx · March 29, 2021, 7:03am

Yes, I know about tree-sitter-scala, @mjambon actually extended reason-tree-sitter and wrote a binding for tree-sitter ( GitHub - returntocorp/ocaml-tree-sitter: Generate OCaml parsers based on tree-sitter grammars. ) but in our experiments tree-sitter-scala could not parse most scala projects out there, there was a 85% parsing success rate. We could extend tree-sitter-scala, and we tried a little, but the grammar in tree-sitter-scala differs quite a bit from the official EBNF spec, so I was wondering if someone wrote a working, complete, scala parser in OCaml. Apparently someone did, @thomas_ridge, but he lost the code … I guess scalaparse from fastparse mentioned before could be a good starting point. Thanks everybody!

Jason_Nielsen · March 29, 2021, 6:03pm

I just came across ocaml-tree-sitter the other day. How battle hardened are the tree-sitter parsers currently? The last time I played around with some of them they were not complete. Clearly the original purpose of tree-sitter was for a better text editor experience. My guess is that being 85% complete is probably enough to greatly improve text edition but isn’t so great if you are trying to use it to parse and generate code. The last few percent are the hard miles in my experience when trying to create a complete parser for a language… all the weird and wonderful corner cases rear their heads.

mjambon · March 29, 2021, 7:36pm

It depends on the language and how much work was put into it. One nice thing about tree-sitter is error recovery. Most often, parsing errors due to unsupported syntax can be skipped over, affecting only a small region of the source file. This is useful for code browsing and code search. With error recovery, a given file could be parsed with 90% success rather than 0%. That’s how we get to 99.8% success overall for typescript.

314eter · March 29, 2021, 7:56pm

Even tree-sitter grammars that are 100% complete are not necessarily usable for code generation. tree-sitter-ocaml parses all valid OCaml code (as far as I know, there may be bugs that I didn’t find yet), but it parses some invalid code too (which can be a problem for some applications), and may attach attributes to wrong nodes.

aryx · May 11, 2021, 1:11pm

I ended up porting the recursive descent parser in the Scala compiler to OCaml …
I think it was the fastest way to get a working parser from OCaml …

github.com

returntocorp/pfff/blob/develop/lang_scala/parsing/Parser_scala_recursive_descent.ml

(* Yoann Padioleau
 *
 * Copyright (C) 2021 R2C
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License (GPL)
 * version 2 as published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * file license.txt for more details.
*)
open Common
module T = Token_scala
module TH = Token_helpers_scala
module Flag = Flag_parsing

open Token_scala
(*open AST_scala*)

This file has been truncated. show original

Topic		Replies	Views
Looking for an OCaml parser in ANTLR or JFlex Ecosystem	6	1881	April 16, 2021
OCaml Parser combinator library as powerful as fastparse(Scala)? Ecosystem	27	4222	November 25, 2022
My way from LALR parsing to combinator parsing Community	6	2743	March 12, 2021
I created an OCaml grammar for ANTLR4 (Earley parser compatible) Community	0	153	March 10, 2025
Rust parser in OCaml? Learning	10	2079	October 1, 2020

Parser for the Scala programming language?

Related topics