Is OCaml vulnerable to https://trojansource.codes/?

Described on that page:

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level. … These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens. … Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

Can the OCaml parser be tricked into doing this with Unicode strings in the source code?

More info: explanation of how this was fixed in Rust: Security advisory for rustc (CVE-2021-42574) | Rust Blog

4 Likes

This is more an editor issue than a language vulnerability. There is no language features nor parser involved, it is a matter of editors following the unicode semantics to display

"left-to-right-text\u{202E} small right-to-left text \u{2066} a left-to-right isolate text\u{2069}  back to right-to-left \u{2066}"

as

"left-to-right-text‮ small right-to-left text ⁦ a left-to-right isolate text⁩  back to right-to-left ⁦"

Any language that allow unicode string literal with a single delimiter is concerned.
(in particular, the issue is visible on

{|left-to-right-text‮ small right-to-left text ⁦ a left-to-right isolate text⁩  back to right-to-left|} 

)
It would be doable to emits a warning for all string literals that contain some “graphical control characters” but I am not sure if this is warranted.

I am also wondering if the the hieroglyph control character could also be theoretically abused to obtain a similar effect (if there is anyone that has implemented them).

Well the language feature involved is Unicode string literals and comments.

Here’s a repro of fig. 5-6 from the paper for OCaml

cat - > a.ml <<EOF
let access_level = "user"
let main () =
  if access_level <> "user<U+202E><U+2066> (* Check if admin *)<U+2069><U+2066>" then
    Printf.printf "You are an admin.\n"

let () = main ()
EOF
ocamlc a.ml
./a.out
You are an admin.

I wonder how practical such attacks can be. But OTOH what rust decided to do, i.e. it seems to warn on occurences of the problematic code points is not entirely unreasonable.

Checking comments and string literals for these shouldn’t be too hard now that we have UTF-8 decoders in the Stdlib :–)

9 Likes

Especially since the level of support for these Unicode features remains uneven so it’s likely someone will discover – e.g. cut and pasting the example above in my terminal reveals something weird is happening, as does git diff and the example doesn’t work in Safari – but it does in all other three major browsers.

Also impacts VSCode, which seems like a big problem:

Indeed, that is a problem.

However, if one is using gopcaml-mode in GNU Emacs, then the structural highlights clearly show the sneaky unicode trickery for what it is.
2021-11-02-124836_3840x1080_scrot

I guess the message is clear: if you care about security, then proprietary editors like VScode just aren’t up to scratch, and GNU Emacs is the way forward!

3 Likes

It is interesting to see that syntactic highlighting is giving away the trick since (* Check if admin *) is colored with the string literal color:

(* This is really a comment *)
if access_level <> "user‮⁦ (* This is not a comment *)⁩⁦" then

OT: I think VScode is free software nowadays.

Anyways according to the paper (see table II appendix I) you should be using vim on Windows® :–)

Indeed that’s not mentioned in the paper. But OTOH when playing with the example I personally didn’t notice.

Also we now maybe have a good argument to justify that OCaml has no curly braces.

For those people who are interested in how all this manifests itself on github, I made a gist with the example above.

The first commit has no vulnerability, the second one introduces it. So you get a big warning in the diff and on the file that there are hidden bidi characters being used.

See for yourself here. Github’s announcement of the feature.

8 Likes

Confirmation of @octachron 's remark about the coloring of (* Check if admin *).

OCaml 4.13.1+flambda - Emacs 29.0.5 with Tuareg mode &
the Modus-vivendi theme (options ‘yellow-comments’ and 'green-strings)

I think research!rsc: On “Trojan Source” Attacks explains the essential point. Although the same trick can be applicable to OCaml programs, this is not a issue of compilers, but a issue of review processes.

I think it is right to say that one should not freak out about this and that it’s being a tad overblown. But I disagree with the idea that this is not a language issue and nothing should be done about it.

First it is easy to design a language which is not affected by the issue so claiming it’s not a language issue is a bit dubious.

Second disallowing some Unicode code points or allow some of them only in escaped form for security reasons is routinely done in the specification of file formats (an obvious example being U+0000 which always makes your C strings happy). I don’t really see a strong argument on why we shouldn’t strive to do the same in the file formats in which our programs are stored, especially when the mecanisms in cause may be of little use to the programming practice, except for obfuscated code contests.

4 Likes