Parsing shell commands

Hey everyone,

As a first project, I’m trying to write a simple shell in OCaml. I want to be able to parse commands that look like this:

ENV_VAR=value ENV_VAR=value command arg1 arg2

I’ve written a menhir/ocamlyacc grammar that parses commands and arguments just fine, but fails when they are preceded by environment variables. Here is the relevant parser code:

%{
  type expr = Empty | Expr of (string * string) list * string * string list
%}

%token <string> WORD
%token <string> KEY
%token EQ
%token EOL

%start main
%type <(expr, string) result> main

%%

main: expr EOL { Ok $1 }
  | error EOL  { Error "Parse error" }
;

expr:              { Empty }
  | WORD           { Expr ([], $1, []) }
  | WORD args      { Expr ([], $1, $2) }
  | env WORD       { Expr ($1, $2, []) }
  | env WORD args  { Expr ($1, $2, $3) }
;

// ...arg parsing...

env: var     { [$1] }
  | env var  { $2 :: $1 }
;

var: KEY EQ WORD { ($1, $2) }
;

Here is the relevant lexer rule:

{
  open Parser
  exception Eof
  exception SyntaxError
}

let char = [^ '\\' ' ' '\t' '\n' '|' '=']
let alpha = ['a'-'z' 'A'-'Z']
let keyChar = ['a'-'z' 'A'-'Z' '0'-'9' '_']

rule token = parse
  | [' ' '\t']          { token lexbuf }
  | '\n'                { EOL }
  | '='                 { EQ }
  | char+ as w          { WORD (w) }
  | alpha keyChar* as w { KEY (w) }
  | eof                 { raise Eof }
  | _                   { raise SyntaxError }

Adding a separator between environment variables and the rest of the expression didn’t seem to solve the problem, so my understanding is that the parser (or perhaps the lexer, in this case) cannot distinguish between KEYs and WORDs in certain cases.

Is there a better way to approach this problem? Any advice or insight is appreciated.

There are a lot of shells out there in the wild, and most of 'em OSS. I’d go dig up a few, say, bash, csh, zsh, ash, and have a quick look at how they parse. Which formalisms they use (LALR? LL? hand-coded recursive descent (hope not!)) and maybe even I could steal some code, if they use LALR.

At least, that’s how I’d start.

ETA: it won’t help with the “how do I design the AST & evaluator” but it might help with parsing and lexing.

There is at least one problem in the lexer: the char+ rule in the parser is a superset of the alpha KeyChar* rule, thus KEY tokens are never produced.

There is also a grammar-based parser in Cicada - Rust-written POSIX shell. It might be helpful for your project to check its internals too.

This is a bit off-topic since @AjayMT is probably not aiming for a fully posix-compatible shell, but I should mention the CoLiS project (currently ongoing) that focuses on building tools for static analysis of shell scripts (install scripts from debian packages being the main target).

One achievement of the project is Morbig, a “trustworthy shell parser” implemented in OCaml, which parses posix shell statically and has been validated against the specification.
They manage to use a menhir-generated parser to do the bulk of the work, and then use menhir’s incremental API to instrument the parser in order to handle the unorthodox parts of the specification.
The companion paper is quite a fun read: https://niols.fr/materials/papers/sle18.pdf

@Chet_Murthy you’d be surprised… To quote the paper above:

The parser of Dash 0.5.7 is made of 1569 hand-crafted lines of C code. This parser is hard to understand because it is implemented by low-level mechanisms that are difficult to relate to the high-level specification of the POSIX standard: for example, lexing functions are implemented by means of gotos and complex character-level manipulations; the parsing state is encoded using activation and deactivation of bit fields in one global variable; some speculative parsing is done by allowing the parser to read the input tokens several times, etc.

In fact, POSIX shell is very challenging to parse statically because lexing may depend on parsing, or even the evaluation (!) in full generality (because of alias)… The difficulties are well described in the paper.

1 Like

There is also the ShellCheck for static analysis of the shell scripts. Production-ready tool, but in Haskell.

Interesting. I wonder if Dash has a claim to fame of being really low-overhead? I know that back in the day, the Yacc authors (or maybe somebody else) showed that they could could achieve some speedup by (in FP terms) partially-evaluating the stack automaton and actions, with the tables – resulting in a massive goto-nest. One presumes that the same thing is true of lexing.

In any case, it seems that bash itself uses a Yacc grammar. I’m not at a real machine, so can’t check what its lexer is written in/with.

The reason people use ash/dash for scripts instead of bash is dramatically better execution time. I suspect, though, that parsing overhead is a small part of that.

Agreed. And I find no … “breakdown” of the reasons for which dash is faster than bash. Which … as an engineer, I find “interesting”, but hey, I get it: software -is-. Ah, well.

FWIW, it’s clear that dash is wildly faster than bash. Wildly.

Dash is just a modified version of ash, FYI. It wasn’t intended as such to be a faster shell; it was written to be an unencumbered reimplementation of the Bourne shell in an era where few such things existed. It’s used as /bin/sh on the various BSDs. Debian forked it and modified it.