The trick is to use Unicode control characters to reorder tokens in source code at the encoding level. … These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens. … Compilers and interpreters adhere to the logical ordering of source code, not the visual order.
Can the OCaml parser be tricked into doing this with Unicode strings in the source code?
This is more an editor issue than a language vulnerability. There is no language features nor parser involved, it is a matter of editors following the unicode semantics to display
"left-to-right-text\u{202E} small right-to-left text \u{2066} a left-to-right isolate text\u{2069} back to right-to-left \u{2066}"
as
"left-to-right-text small right-to-left text a left-to-right isolate text back to right-to-left "
Any language that allow unicode string literal with a single delimiter is concerned.
(in particular, the issue is visible on
{|left-to-right-text small right-to-left text a left-to-right isolate text back to right-to-left|}
)
It would be doable to emits a warning for all string literals that contain some “graphical control characters” but I am not sure if this is warranted.
I am also wondering if the the hieroglyph control character could also be theoretically abused to obtain a similar effect (if there is anyone that has implemented them).
Well the language feature involved is Unicode string literals and comments.
Here’s a repro of fig. 5-6 from the paper for OCaml
cat - > a.ml <<EOF
let access_level = "user"
let main () =
if access_level <> "user<U+202E><U+2066> (* Check if admin *)<U+2069><U+2066>" then
Printf.printf "You are an admin.\n"
let () = main ()
EOF
ocamlc a.ml
./a.out
You are an admin.
I wonder how practical such attacks can be. But OTOH what rust decided to do, i.e. it seems to warn on occurences of the problematic code points is not entirely unreasonable.
Checking comments and string literals for these shouldn’t be too hard now that we have UTF-8 decoders in the Stdlib :–)
Especially since the level of support for these Unicode features remains uneven so it’s likely someone will discover – e.g. cut and pasting the example above in my terminal reveals something weird is happening, as does git diff and the example doesn’t work in Safari – but it does in all other three major browsers.
However, if one is using gopcaml-mode in GNU Emacs, then the structural highlights clearly show the sneaky unicode trickery for what it is.
I guess the message is clear: if you care about security, then proprietary editors like VScode just aren’t up to scratch, and GNU Emacs is the way forward!
For those people who are interested in how all this manifests itself on github, I made a gist with the example above.
The first commit has no vulnerability, the second one introduces it. So you get a big warning in the diff and on the file that there are hidden bidi characters being used.
I think research!rsc: On “Trojan Source” Attacks explains the essential point. Although the same trick can be applicable to OCaml programs, this is not a issue of compilers, but a issue of review processes.
I think it is right to say that one should not freak out about this and that it’s being a tad overblown. But I disagree with the idea that this is not a language issue and nothing should be done about it.
First it is easy to design a language which is not affected by the issue so claiming it’s not a language issue is a bit dubious.
Second disallowing some Unicode code points or allow some of them only in escaped form for security reasons is routinely done in the specification of file formats (an obvious example being U+0000 which always makes your C strings happy). I don’t really see a strong argument on why we shouldn’t strive to do the same in the file formats in which our programs are stored, especially when the mecanisms in cause may be of little use to the programming practice, except for obfuscated code contests.