Not to detract from your main point about Rust being massive, but this comparison is not entirely fair: the standard OCaml build uses a binary bootstrap too. An experimental source bootstrap exists, but it will add several hours to the total build time.
The “binary” bootstrap you are talking about is platform-independent bytecode.
@mneumann I never claimed otherwise. In the context of reproducibility, the meaningful distinction is whether it’s human-readable source code or not.
One supposes, in the spirit of the (justly) famous Ken Thompson “login.c” + “c compiler” backdoor he inserted into UNIX that didn’t show up in any sources: That Time Ken Thompson Wrote a Backdoor into the C Compiler
Yes, Thompson’s epic hack is a relevant case study. There’s more discussion about motivation at reproducible-builds.org. It’s not that we don’t trust the compiler developers, but they are targets for bad actors who want to steal credentials and execute supply chain attacks. Such attacks become that much easier when the repo changes are difficult to audit.
To any moderators reading this: may I request forking off the “reproducibility” comments into a separate thread?
Can’t we trust the in-source binary bootstrap if it’s in a commit signed by the developers? After all, we trust their signed commits anyway, so it doesn’t introduce any new attack surface.
I think the idea is that transparency is preferable to blind trust, and actual source code is the key to that transparency. It’s not that tricky source code couldn’t conceal a backdoor, but there are more people in the community who can read source than can decipher raw bytecode.
And since each iteration of the bootstrap is built from its predecessor, it’s not just about trusting a specific update; we need to trust the entire chain of bootstraps (as illustrated by the Ken Thompson link, to pick a less serious example).
an interesting write-up with a historical demo can be found on research!rsc: Running the “Reflections on Trusting Trust” Compiler
I’d assume that the binary bytecode bootstrap, which is part of the Ocaml git repo, is reproducible. This means that for the same OCaml compiler source code you’d get exactly the same binary bytecode as output. If this assumption holds true, you could just look at the OCaml compiler source code… or am I missing something?
I think that one challenge in these discussions is that people can mean different things when they say “reproducible”. For example:
Suppose that you’re a biologist who wants to reproduce a result by one of your colleagues. The experiment requires a reagent, but the ingredients aren’t disclosed. Your colleague makes some assertions about how the reagent works, and they send you a batch that they made themself. You produce results consistent with the original findings, and these results are identical over several trials.
In this scenario, even though the results were consistent, one could say that you didn’t reproduce anything, because you didn’t replicate an experimental input from known ingredients, nor can you characterize how it works without taking your colleague’s word.
Just because the binary output of a compiler is reproducible and deterministic does not mean that the compiler has not been subverted. As a consequence, you cannot deduce anything from the purity of the source code. That was the whole point of Thompson’s thought experiment. Once you have used a tainted compiler at some point in the past, you can never be sure of anything down the line. (Said otherwise, the only safe compiler is the compiler you do not use.)
one thing to note though: Ken Thompson’s thought experiment and demo is just that: a thought experiment and a demo. the tainted compiler detects that it is building itself based on name only. it’s impossible that a past c compiler could have foreseen what a modern compiler would be named or what its sources would look like.
and so you can use a tainted compiler that predates ocaml, use it to bootstrap ocaml, and be confident that the tainted compiler won’t successfully taint the built ocaml compiler.
we break the chains of self-reproduction of these poisoned compilers all the time by making new languages!
(i guess theoretically, someone could be watching the PL and compiler research space and mess with all the compilers that are used for bootstraping (c for ocaml and go, ocaml for rust, etc.). but it’s just conspirational thinking at that point.)
Making new languages does not help. Indeed, the x86 dialect has hardly changed the last few decades. So, instead of detecting some input language, you could instead use the output language as the criteria. If the tainted payload notices that the program is writing a sequence of bytes that corresponds to the x86 opcodes for a write system call, it could easily interfere with it and output itself along the way. Perhaps the tainted code will be unlucky and it will sometimes corrupt a .jpg image that happens to look like some x86 code. But most of the time, the tainted code will achieve its primary goal, that is, it will corrupt any new executable, which includes compiler, assembler, linker, etc.
Thompson was using the input language for his experiment, but it is even simpler if you are targetting the output language. At the time of Thompson, there were lots of output languages and few input languages, so targeting the input language was the most effective. Nowadays, it is the opposite; you have a bazillion of programming languages, but only three mainstream output languages. (While I have used x86 as an example, the same reasoning obviously holds for all the mainstream instruction sets.)
Not never: Thompson’s though experiment was followed up by Diverse Double-Compiling, also mentioned Cox’s article above, which is formally proven to either detect or counter a trusting-trust attack[1] on a given untrusted compiler, as long as you can access a different trusted compiler, and the untrusted one can produce reproducible builds.
There’re some edge-cases that require repeating the experiment for each new build, but those cases also make the attack harder to hide. ↩︎
Yes, and the “trusted” compiler can even be a slow, partial implementation that’s only adequate for building the other compiler. That’s essentially what camlboot is. (In its current form, it can only bootstrap older compilers - up to 4.07 I think.)
What makes it possible to “trust” camlboot (assuming that someone actually takes time for an audit) is that it’s fully source-bootstrapped . So in addition to validating the official compiler’s bytecode periodically with Diverse Double-Compilation, it is in theory possible to replace it.
[Edit: the camlboot authors did in fact use Diverse Double-Compilation to confirm the integrity of the OCaml 4.07 bootstrap.]
wonder if Ocaml was bootstrapped inside a FASM implementation.. Would compile down to x86-64 binary in minutes.
There is no such thing as a fully source-bootstrapped compiler nowadays. (There were some exceptions back in the past, since some processors were directly able to execute Forth, Lisp, or even Fortran code. But it has been decades since then.) For example, Camlboot relies on a precompiled Scheme compiler (or interpreter). You might trust that this Scheme compiler is not corrupted. And you might also trust that the static/dynamic linker of your computer is not corrupted either. And also your operating system. (Remember: Thompson’s ultimate goal was to corrupt the operating system, not just the compiler.) And all those other pieces that you need to trust. If any of these pieces were compiled by a corrupted compiler, how can you trust that the resulting OCaml bytecode is untainted?
Sure, the Camlboot project goes a very long way toward guaranteeing that the OCaml compiler is untainted. But believing that it goes all the way is naive.
By the way, are you aware that this “fully source-bootstrapped” compiler you are talking of contains a very large piece of automatically generated code that is impossible for a human to review? I am talking about the transition tables of the lexing automaton. But surely, it is not possible to subvert this automaton so that it surreptitiously includes a sequence of tokens in the input of the compiler. Or is it?
Fair - “reproducible” is more of a Platonic ideal where the best we can do is approach it asymptotically. I should have said “as source-bootstrapped as possible with modern techniques.” I also didn’t realize that camlboot’s “MiniML” compiler stage still depends on Guile (self-bootstrapped), instead of something like Mes (which has a pure C implementation). However, Mes keeps improving and may one day be suitable as a Guile replacement for bootstrap seeds.
As for the integrity of the linker, OS, etc: some system distributions take this more seriously than others. Guix (which is closely associated with the camlboot work) has been making steady progress on reducing the size of its binary seeds. Even Debian has been introducing basic reproducibility requirements for packages, which is a start.
As for the lexer, I’m not sure which file you mean. Is it this one? In any case, I’m pretty sure that I’d rather review changes to any single source file than changes to 3MB of bytecode, especially if the size of the binary diff was not well correlated with the size of the source diff that produced it.
I’m not clear on the goal of your comments, but I hope it’s not that we should despair and accept this an unsolvable problem. Many people are working toward a full(ish) chain of trust from different directions, but if they demanded an immediate and total payoff for their work, nothing would ever be accomplished.
No, I was talking about the following one: camlboot/miniml/interp/lex/lexer.ml at 4.09 · gasche/camlboot · GitHub . Not sure which repository is supposed to be the source of truth regarding Camlboot nowadays.
It is an unsolvable problem and we should stop saying things that could lead less security-inclined people believe that they have a fully secure computing environment. But that does not mean that we should despair either. Any step that reduces the attack surface is a victory and I am glad the Camlboot project exists.
Computer security as a whole is an unsolvable problem, to be sure. But suppose that the total size of an OS’ binary seed were reduced to the sub-kilobyte range (as Guix hopes to accomplish one day). At some point, we reach a scale where binary audits by experts become possible, no? And by the nature of their tiny scope, these blobs would seldom change. If all of that somehow came true then I would count this as a practical solution to reproducibility, if not a satisfying one. Of course, we’ll also need processors with open source silicon and microcode…
I can’t speak for the developers, but GitHub - gasche/camlboot: Experiment on how to bootstrap the OCaml compiler · GitHub does show that it’s 222 commits behind GitHub - Ekdohibs/camlboot: Experiment on how to bootstrap the OCaml compiler · GitHub , so I can take a fair guess.
I see no equivalent file in the latter repo. Interestingly, I do see an interpreter bytecode file elsewhere in the repo that shares some of the same symbol names. I’m not certain of its purpose there, since the Makefile does have a rule to build it. Maybe this is a way to shorten debug cycles for dependent code, since a full build of the project takes hours.