[ANN] BAP 2.1.0 Release

The Carnegie Mellon University Binary Analysis Platform (CMU BAP) is a suite of utilities and libraries that enables analysis of programs that are represented as machine code (aka binaries). CMU BAP is written in OCaml and uses plugin-based architecture to enable extensibility. We also have a domain-specific language, called Primus Lisp, that we use to write analysis, specify verification conditions, interact with the built-in SMT solver, and model the semantics of machine instructions and functions.

The 2.1.0 Release is very rich in new features but the most prominent addition is the new symbolic executor mode for the Primus framework. We also significantly updated the Primus framework, integrated it with our new Knowledge Base, which was introduced in the BAP 2.0 release; we made our interpreter much faster; we added the systems and components facilities, inspired by Common Lisp; and we implemented a gradual type checker for Primus Lisp with type inference. We also added an ability to represent machine instructions as intrinsic functions so now it is possible to express their semantics using Primus Lisp since we added IEEE754 primitives to the Lisp interpreter.

As usual, we upgraded BAP to the newer versions of the Core library and OCaml (we now support OCaml versions from 4.07 to 4.09). We also significantly improved our build times and added an optional omake backend, which we are using in-house.

From the user perspective, one of the key features of BAP as an analysis platform is that you can run BAP on binaries that you can’t run otherwise, either because they need special hardware or software, or need to interact with the outside world. In the past couple of months, we have run BAP on various firmware and found numerous zero-day vulnerabilities, particular, we were able to find critical vulnerabilities in the VxWorks operating system that runs on, potentially, billions of devices including mission-critical and military appliances.

As always, questions, suggestions, and opinions are very welcome!

13 Likes

Thanks for your announcement. BAP is an inspirational project – it does things that one may think that “should” be done in C/C++/Rust instead of OCaml. The fact that it is programmed in OCaml (with bindings to other programs of course) makes it such an important data point.

@ivg Can you share some thoughts on this if possible – what have you lost performance wise when doing things like binary analysis and other CPU intensive stuff over C/C++? The gains are clear – modeling power in OCaml is probably much better than C/C++. Garbage collection is also probably very useful too.

TL;DR I am convinced about the positives. I was wondering if you can give me a few words on the negatives of using OCaml in your specific area of binary analysis

1 Like

Hi @ivg, thanks for your work.

Currently I’m working on optimized matrix-multiplication code generation and I am comparing two slightly different versions of the code, one using a standard vector load inside a if, the other one using a mask vector load. It appears that the “mask” version performs significantly better that the other one when compiled with gcc, while the two versions behave similarily on clang. I’m trying to investigate this issue, do you think your tool could help me find out the difference ? Reading assembly is painful… If you think it could help, can you hint me where and how to start ?
Thanks,
Nicolas

1 Like

Read the code in bil – look at BAP documentation. It’s an intermediate language that makes reading assembly much easier.

2 Likes

(There is a command line utility available in the bap executable that will allow you to see code in bil. I don’t remember the specifics but it should be easy to figure out the correct bap command invocation by seeing the bap help. So you don’t even need to write any OCaml code. Simply point it at the executable and try to understand the bil output)

1 Like

For an honest comparison, we would need the same project implemented in both OCaml and C/C++/Rust and then apply it to the common set of benchmarks, something that is done for smtlib. Alt-Ergo, an SMT solver written in OCaml, shows us that it is quite competitive with solvers written in C/C++ and even winning in some categories, including categories that include parallel execution (for those who think that OCaml has problems in the multicore world).

In other words, my personal impression is that OCaml performance is in general close to the performance of C/C++/Rust and for our tasks is not that important. Both radare2 and bap are quire slow on large binaries, e.g., Google Chrome, which is a 100Mb beast full of code. Btw, IDA Pro the current state of the art commercial disassembler can’t reconstruct the control flow graph of chome, while we can (well at least in our experimental branch).

To summarize, performance is far from being an issue with binary analysis in OCaml. The main negative of BAP being written in OCaml is that it is written in OCaml :slight_smile: Many people are disregarding OCaml as a niche language suitable only for academic purposes, some people even hate OCaml due to painful experiences in the college and, in general, there are very few OCaml programmers and very few people at all want to learn something new. So the fact the you have to code in OCaml is the main limitation of BAP, mostly because it raises the bar, from learning binary analysis (which is already very hard) to learning OCaml first and then learning binary analysis.

It is very unfortunately mostly because there are no real reasons for OCaml unpopularity, it is an extremely simple but powerful language with clean syntax and mature infrastructure. But languages spread like viruses, exponentially, and for some reason OCaml get a little bit less traction then some other languages, even if it is clearly better :slight_smile: That is why one of our missions is to evangelize OCaml and help people to learn it. By the way, we plan an interactive BAP tutorial during the upcomming OCaml 2020 Workshop, stay tuned.

TL;DR; there are no negatives, modulo limited user/contributors base. Even boxed integers is not a problem, thanks to zarith.

9 Likes

Sure, you can just do bap ./exe -d and it will output a readable IR. For example, this simple C function

void print_endline(const char *msg) {
    char *p = msg;
    while (*p) putchar(*p++);
    putchar('\n');
}

after being compiled and disassembled back with BAP will have the following IR,

00001220: sub print_endline(print_endline_result)
0000125e: print_endline_result :: out u32 = RAX
000004e3: 
000004ea: #53 := RBP
000004ed: RSP := RSP - 8
000004f0: mem := mem with [RSP, el]:u64 <- #53
000004f7: RBP := RSP
00000508: RSP := RSP - 0x20
00000521: mem := mem with [RBP - 0x18, el]:u64 <- RDI
00000528: RAX := mem[RBP - 0x18, el]:u64
0000052f: mem := mem with [RBP - 8, el]:u64 <- RAX
00000537: goto %00000533

00000533: 
0000053d: RAX := mem[RBP - 8, el]:u64
00000544: RAX := pad:64[pad:32[mem[RAX]]]
00000551: #57 := low:8[RAX]
00000563: ZF := 0 = #57
0000056d: when ~ZF goto %00000567
0000123f: goto %000005ab

000005ab: 
000005b0: RDI := 0xA
000005b9: RSP := RSP - 8
000005bc: mem := mem with [RSP, el]:u64 <- 0x400731
000005bf: call @putchar with return %000005c1

000005c1: 
000005c8: RSP := RBP
000005cb: RBP := mem[RSP, el]:u64
000005ce: RSP := RSP + 8
000005d7: #59 := mem[RSP, el]:u64
000005da: RSP := RSP + 8
000005de: call #59 with noreturn

00000567: 
00000577: RAX := mem[RBP - 8, el]:u64
0000057e: RDX := low:64[RAX + 1]
00000585: mem := mem with [RBP - 8, el]:u64 <- RDX
0000058c: RAX := pad:64[pad:32[mem[RAX]]]
00000593: RAX := pad:64[extend:32[low:8[RAX]]]
0000059a: RDI := pad:64[low:32[RAX]]
000005a3: RSP := RSP - 8
000005a6: mem := mem with [RSP, el]:u64 <- 0x40071C
000005a9: call @putchar with return %00000533

I think it is pretty readable without a vocabulary, but if need we have a formal specification (see the releases page for the pdf) for it (even formalized in Coq).

You can also use -dcfg or -dgraph to see the graphs, e.g., see this demo. But the main power of BAP is an ability to write tools that analyze programs that are that big that no human can cope with it. So, at the end of the day you could integrate BAP in your CI to verify that your synthesized codes are up to the standards.

ProTip: use the --optimization-level=3 option, to let BAP eliminate unnecessary flag computations in x86/x86-64.

2 Likes

As usual thank you for the detailed answer! Yes, OCaml’s limited user base is a problem but I see that changing with all the awesome and mature tooling that we now have (ocamllsp/merlin, dune, opam), the major improvements in the compiler over the last few versions, better error reporting etc.

With multicore finally on the horizon I think OCaml’s time has arrived. (I think OCaml is pretty awesome even without multicore).

Last question: Do you use flambda in your production releases of bap? Does it improve runtime performance? How much difference does it make for you? 10%? 20%? Curious here too…

Yes, we do, when we release binary packages, e.g., debs and rpms.

The difference exists but there is a caveat, unless you specify -O2 or better -O3 flambda is not really jumping in, so just compiling BAP or any other project under flambda will not give you anything. Unfortunately, setting -O{2,3} for bap is not enough is we spend most of the time in our dependencies, e.g., core_kernel which also doesn’t enable optimizations.

This is a quite unfortunate situation as all the hard work done by the flambda developers is not really used. I wish ocamlopt had an environment variable that we can set to enable optimization even if a package is not enabling it.

Speaking of the numbers, I am seeing improvements between 10% and 30% that are mostly coming from significantly reduced pressure on the garbage collector.

2 Likes

You could probably pin your core_kernel dependency and add -O2 right?

Or do you hesitate to do that for reliability/correctness/convenience reasons?

I am surprised that core_kernel does not enable optimizations…

Sure, but it doesn’t scale :slight_smile: Pinning the whole suite of Janestreet is not something one would like to do manually.

For a long time -O2 and -O3 was not present in vanilla (non flambda) OCaml, so adding this option would break the installation process. I am myself not sure from which version it is available. At least starting from 4.07.0 it is there.

1 Like

Does it not work to export OCAMLPARAM=_,O3=1 and then initialize an +flambda opam switch with the dependencies in order to compile everything with -O3?

7 Likes

Great, so we already have one! I remember I was using it for adding binary annotations, but completely forgot about it since then) I will try and rebuild the world with -O3!

4 Likes

OK, so after recompiling everything with -O3, I’ve got about 40% improvement in such simple task as binary disassembling, which is fantastic, nearly twice as fast. On more complex benchmarks the results are less impressive, something between 10% to 20%.

The cons. It took a lot of time to compile everything (I can’t tell how much, because in the middle I run out of disk space and had to revive my system) and the binaries are now roughly 4-5 times larger:

du -hs ~/.opam/*
3.6G    /home/ivg/.opam/4.09.0
15G     /home/ivg/.opam/4.09.0+flambda
6 Likes

This was such an awesome find!! How big are some of the important binaries now in MB – again just curious.

This above reply needs more visibility so others can adopt in their own projects!

(Maybe a short blog post if the idea appeals to you. So when people type in flambda in a search engine they might come to your post and learn from your findings!)

BAP Announcement

(for those who clicked this link, that I have accidentally posted on reddit)

The Carnegie Mellon University Binary Analysis Platform (CMU BAP) is a suite of utilities and libraries that enables analysis of programs that are represented as machine code (aka binaries). CMU BAP is written in OCaml and uses plugin-based architecture to enable extensibility. We also have a domain-specific language, called Primus Lisp, that we use to write analysis, specify verification conditions, interact with the built-in SMT solver, and model the semantics of machine instructions and functions. The BAP 2.1.0 Release offers lots of new features and a new symbolic executor based on z3. Enjoy!


Well, fasten your seatbelts and get ready for the show:

$ du -h $(which bap)
92M     /home/ivg/.opam/4.09.0+flambda/bin/bap

That’s a lot, but bap uses a plugin based architecture, so most of the functionality is shipped in plugins, which are loaded dynamically on-demand, so how big are the plugins?

$ du -hc $(opam config var lib)/bap/*.plugin | grep total
628M    total

So the total distribution of bap will weight 700Mb. Well if we will compress all this it will take only 180Mb which is twice the size of our current debian package compressed with the same algorithm1. So it is not that bad as it might look like. Especially, since this distribution has llvm, z3, core_kernel, ocaml, and many more packages in it and can be run standalone.

We will probably switch our CI and packaging scripts to use OCAMLPARAM, the only concern is whether our CI infrastructure will be able to sustain such a heavy load. My desktop yesterday wasn’t)

And yes, will be keeping people in posted. I would also like to investigate why there is such a volatility in the results, e.g., 40% vs 10%.


1)We used to compress our plugins, but stopped doing this for performance reasons.

3 Likes

I wish so, but there is a long way to go. The other three languages I program in (Go, Rust, and TypeScript) have such decent tooling, from simple editors and language servers to powerful IDEs from JetBrains. Go and Rust have decent compilers. Dependency management is straightforward too (I’m still figuring out opam). Dune is awesome, I’m sure it can compete with Cargo.
OCaml also has significantly fewer libraries than those. I couldn’t find a usable MongoDB driver for OCaml. I couldn’t find a simple embedded key-value store (Go has lots of those: BoltDB, Badger, BuntDB, etc). Even Haskell has a richer ecosystem. I really like OCaml, but a lot more has to be done. I’m not skilled enough to take on such projects now, but I’m learning. Hopefully someday I’ll be able to port such useful projects to OCaml.

3 Likes

There is Irmin… https://irmin.org/

3 Likes

Oh, yea. I know of Irmin. Nice project. Now I have ideas on what I could build with it. :star_struck: