[ANN] BAP 2.0 Release

The Carnegie Mellon University Binary Analysis Platform (CMU BAP) is a suite of utilities and libraries that enables analysis of programs in their machine representation. BAP is written in OCaml, relies on dynamically loaded plugins for extensibility, and is widely used for security analysis, program verification, and reverse engineering.

This is a major update that includes lots of new features, libraries, and tools, as well as improvements and bug fixes to the existing code. The following small demo showcases the modern BAP look and feel.
In this announcement we would like to focus on two very important features of BAP 2.0:

The Knowledge Base

The Knowledge Representation and Reasoning System, or the Knowledge Base (KB) for short, is the central part of our new architecture. The KB is a step forward from the conventional approach to staging multiple analyses in which dependent analyses (aka passes) are ordered topologically, based on their dependencies. The KB is inspired by logic programming and enables an optimal and seamless staging of mutually dependent analyses. Mutually dependent analyses are also present in the source-based program analysis but are much more evident in the field of binary analysis and reverse engineering, where even such basic artifacts as control flow graph and call graph are not available as ground truth (and in general are not even computable).

Object properties in the KB are represented with directed-complete partially ordered sets. The KB also imposes the monotonicity restriction that requires that all updates to the property are monotonic, i.e., each consequent value of the same property is a refinement of the previous value. These restrictions enable the KB to compute the least fixed point of any property, is computed. A property representation could be optionally refined into a complete lattice, which gives the KB users extra control on how properties are computed.

By storing all information in an external location the KB addresses the scalability issue so relevant to binary analysis and reverse engineering. In the future, we plan to implement a distributed storage for our Knowledge Base as well as experiment with other inference engines. Soon, it should also possible to work with the knowledge base in non-OCaml programs, including our BAP Lisp dialect. That, practically, turns the knowledge base into a common runtime for binary analysis. In the current version of BAP, the Knowledge Base state is fully serializable and portable between different versions of BAP, OCaml, and even between native and bytecode runtimes. The Knowledge Base state could be imported into an application and is integrated with the BAP caching system.

New Program Representation

Employing the tagless final embedding together with our new Knowledge Base we were able to achieve our main goal - to switch to an extensible program representation without compromising any existing code that uses the current, non-extensible, BAP Intermediate Language (BIL). The new representation allows us to add new language features (like floating-point operations or superscalar pipeline manipulations) without breaking (or even recompiling) the existing analyses. The new representation also facilitates creation of new intermediate languages which all can coexist with each other, making it easier to write formally verified analyses.

Links

10 Likes

The knowledge representation/query feature sounds exciting – thank you for announcement.

I had a very quick look at the BAP documentation “Knowledge representation and reasoning system” – the authors have spent a lot of time precisely documenting all the interfaces!

However, there are lot of very specific technical terms and those not steeped in the BAP way of doing things can get lost after reading a paragraph or two. The definitions are overwhelming.
In general, a lot of us learn from the “example method”. A small example is provided that allows the essential features to be communicated. After that it becomes easier to read the odoc documentation.

What would be awesome is a small example of the new knowledge representation and reasoning system. How can I add some knowledge to the KB and how can I query it?

Also, how are you going to deal with the scalability side of things. Usually a LOT of data can be generated for non trivial programs. How do you deal with the usual problems of large query times? Is there a storage backend for the KB?

3 Likes

Yes, I suggested to create something like Semmle QL (or Infer AL) query language, probably based on Datalog.

1 Like

That’s a very fair point. We need examples indeed. The main challenge here is that the Knowledge Library is so generic and operates with such abstract objects that it is hard to give a good example without unnecessary limiting the scope of the library, at least in the mind of the reader (on the other hand not giving any initial scope is probably worse).

Additionally, I didn’t find any impressive example that will showcase the power of the KB while still being easy to understand. I need something that involves partial knowledge, mutual recursion, and conflicting opinions :slight_smile: If you have any ideas, please share. I hope I will find something. Besides, we also have a blog, and we even started a series of blog posts about the knowledge base which is less formal and more concrete. We will continue soon.

In general, the usage is very simple, you define a class of objects, e.g.,

open Bap_knowledge
type student
let student = Knowledge.Class.declare "Student"

You can define properties, which objects of the class share, e.g.,

let name = Knowledge.Class.property student "name" Knowledge.Domain.string

The name value has type (student,string) Knowledge.slot and could be used to access the name property of any student.

To set the value of a property we use the provide operator, but we need a student to give it a name, so we first create a new student,

let new_student given =
  Knowledge.Object.create student >>= fun s ->
  Knowledge.provide name s given >>| fun () ->
  s

We can access the property value of a student using the collect, e.g.,

let student_name s = Knowledge.collect name s

And so on. Basically, the Knowledge Base provides a runtime for a language with objects. It could be also seen as a heterogeneous dictionary on steroids (besides, it is quite optimized underneath the hood and is faster and more memory efficient than Janestreet’s or OCaml maps or even hashtables - it is a quite convoluted binary tree underneath the hood).

In fact, the Knowledge library is indeed a runtime, it is a runtime for the BAP Lisp language, which we will release a little bit later. The main difference from a common runtime is that we do not really provide real mutability. What we have is a very limited form of mutability, in which you can only add more information to the value, e.g., you can go from None -> Some 42 but you can’t change it from Some 42 to Some 43. This limited form of mutability prevents the common problems of mutable state, while still making it useful. Due to monotonicity restriction (no information could be denied) and internal fixed point, at any point of time the program is always facing a consistent state.

This brings us to the scalability question. First of all, we already introduced an extra layer of indirection, by externalizing the state of our computation from the OCaml runtime to our own runtime. This lets us stop the computation and any moment, migrate to other processes and resume later. We can also pass knowledge between different processes and employ parallelization in cases when it is possible (e.g., in Binary Analysis we may analyze independent libraries in parallel). Right now we have a custom implementation of the runtime in OCaml, but we can later move to some other backends, like Redis for example (at least as the storage).

But eventually (and very soon), the knowledge base will grow beyond the acceptable size. Mostly because it also produces a lot of intermediate knowledge. Indeed, if we wrote our own runtime, then we probably need to implement a GC for it. And this the crucial idea, since our knowledge is monotonic, then instead of implementing a conservative garbage collector, which will negatively affect the performance story, we could just delete arbitrary objects, even if they are still referenced. Indeed, since the monotonicity is enforced by the knowledge base, we can eliminate any knowledge from the context without compromising the correctness. It is like forgetting something, but not worrying about it, because you can always repeat the process. In that sense, the knowledge base is more like a cache and the garbage collection problem turns into the cache invalidation problem. Which is also not a trivial problem but, in our case, it only affects performance, not the correctness (in the worst-case scenario if we have evicted all knowledge, we will just need to repeat the whole cycle of computations from the beginning). But this is all yet to be implemented, right now we are not cleaning our knowledge base and rely a lot on temporary scoped objects. So we can analyze rather big programs and still fit into the memory.

5 Likes

Thank you for your reply. Indeed the KB is quite abstract in scope so I understand the problems you have in explaining via concrete examples.

The wording of your response (and words like monotonicity) remind me of prolog or datalog. Is the KB intended to be a kind of datalog? My understanding/guess so far is that while doing binary analysis you will add facts to this KB about various aspects of this program. The facts you add can only be refinements to the previous facts that you added. So your results can only get more specific as more facts are added (hence the monotonicity).

Also do you have some sort of internal indexing therein? Otherwise running queries from the KB might get slow.

OTOH you talk about objects and then I start wondering if there are aspects of graph databases here…

I guess what would be helpful in absence of code (or even better than code because you can be more tentative in your language) would be something like: The KB can/is intended to be used for the following kind of applications (a) … (b) …

1 Like

Yes, it is heavily inspired by our five years long work on implementing a datalog engine for binary analysis. Thus KB is heavily influenced by logic programming.

Totally correct.

The Knowledge Base is not a deductive database, which is used as a backing for Datalog implementations. The concept of a knowledge base is derived from the expert systems, like CLIPS
for example. We do not store facts as a sequence of tuples, instead, we store object properties, and objects are forming ontologies (in other words we do not have a flat table of all objects, but each class of objects is stored in its own table or, following the runtime metaphor, each class is stored in its own heap). An object is a key in our database and is a scalar value (int63 in OCaml parlance). The value of an object is represented as a binary tree (WAVL tree), which is optimized for storing a small number of fields and compact representation.

The rules itself are OCaml functions and are not stored in the knowledge base, but are registered when the system is loaded. The rules are triggered only once, and only if an object property doesn’t have a value. If more than one rule is registered for a property, then they are merged, using the domain merge operator induced by the associated partial ordering relation (or provided by a user). If a rule is recursive or references another rule that has a rule that involves the current rule, then we compute a fixed point, building the dependency graph on the fly, inspired by François Pottier’s lazy fixed point paper. Once the least fixed point is computed there is no need to recompute it, because we already computed the least fixed point, so the stored rules can’t produce any more information.

Another important feature is that we employ temporary objects a lot when we are doing a query. For example, our typical query looks like, “Let x be an object of class program and has the binary representation s encoded with armv7 encoding in LittleEndian, what is the set of destinations reachable from this program?” The query then triggers disassemblers, lifters, and so on, to compute the set of destinations and then the object x is deleted. Thus there is no trace of x in the knowledge base and therefore no space is lost. Of course, then every time we will need to compute the set of destinations for the structurally same object we will repeat all computations. But in our case, it is faster to trigger the whole chain again then to store in the knowledge base semantics of each byte in the binary. However, once we finish the binary reconstruction (which is also a fixed point computation) we store semantics of those instruction objects that we proved to be reachable and executable. To summarize, using scoped (automatic) objects and some amount of domain knowledge we can find a balance between scalability and performance. Of course, ideally, it should be done by the knowledge base, but the knowledge base can’t do it for free.

Yet another feature of our representation is that not all properties are persistent. In fact, persistence is optional and we even store first-class modules in our knowledge base. It is again the user discretion which property should be stored when the knowledge base is serialized and persist during different runs of the program and which should be more ephemeral and exist only during the current run. This property essentially enables a, sort of, tree shaking optimization, when the knowledge base is stored and all unnecessary or easy to recreate facts are removed. And, as always, truth maintenance guarantees that we won’t lose the correctness (may lose time in the future, though).

3 Likes