[ANN] First release of `conan`, the detective to recognize your file

Conan, an OCaml detective to recognize your file

I’m glad to announce the first experimental release of conan. This tool/library helps us to recognize the MIME type of a given file. More concretely, conan is a reimplementation of the command file:

$ file --mime image.png
image/png

This tool was made to replace our old ocaml-magic-mime project which recognizes the MIME type of the given file via its extension and a static database. However, a security problem remains: ocaml-magic-mime trusts on the user’s input.

The MIME type of a file is not a user’s input (like the file name). It’s a property of the file itself. A more secure way to recognize the type of the file is to introspect contents and compare it with a given database to possibly recognize the MIME type.

So, file and specially libmagic work like that. They have a magic database which describes some magic numbers about some formats. Then, they traverse contents of the given file and compare inner values with what its expected from a certain MIME type.

The goal of Conan/ocaml-magic-mime

The main problem with this approach is the inherent assumption that we manipulate a file from a file-system. However, as we said many times, MirageOS does not have, at first, a file-system (but you can add one if you want).

This is why we made ocaml-magic-mime to be able to recognize MIME type of something (a file, a simple string, a stream, etc.).

You are probably wondering why we need to recognize the MIME type?

In many protocols such as HTTP or emails, we are able to transfer files. The usual way is to let the sender tell the MIME type of the transfered files. A concrete example is the Content-Type field used by HTTP. Indeed, it tells the client the MIME type of the given document - and by this way, the client is able to open this document with the right application.

This is where libmagic comes in and tries to recognize the MIME type as the part of the request’s processes of the server to transfer an image for example.

We aim to have less and less C codes, so we started to re-implement file as our tool to recognize files and let our HTTP server inform the MIME type to the client. Then, instead of trusting the extension of the given file and our database, we started to propose something more secure.

Finally, we took the opportunity to re-use the existing database (under the 2-clause BSD license) for our project (and provide a simple tool conan.file which does the same thing than file). So the challenge was to enable to parse and process this database - and create a little DSL in OCaml to describe format of files.

Into MirageOS

With this DSL, we are able to serialize a given database (as the libmagic database) as a simple OCaml value which can be linked to a larger program such as an unikernel. In this way, we are able to create a full unikernel which is able to recognize MIME type of some contents.

The goal is bigger than that because such little piece of software, as we said, is used by many protocols. Of course, our goal is to integrate our conan library into our HTTP server and be enable it to transfer files regardless of extensions/user’s inputs.

You can see an example here: unikernel.ml

We made an optimization into the given database to keep only MIME informations. Indeed, file tells more than MIME type. It gives a full description of the file such as the size of the image or the bitrate of the music. Of course, for a programmatic use, these informations is useless. So we deleted all of these informations and we finally are able to make a simple statically-linked unikernel of ~6.5 Mb.

Re-update an old project

The file command is pretty old (1987) and it’s implemented with C. A standard of the DSL does not really exists and, of course, it does not take the advantage of a type system such as OCaml. Indeed, the DSL consists into:

  1. an offset into the given file
  2. a “type” of the value at this offset
  3. a test to compare this value
  4. a part of the resultat description which can print the value

For an OCaml developper, it’s sure that we can not mix potatoes and carrot for our salad (even if it’s good). Indeed, for us, the value has a type 'a, the test must be 'a -> 'a -> bool and the description must take an 'a value to print it.

So conan (despite Format) is a nice example of the GADT power to keep along the process the same type as long as we are able to prove that the description of a format is well typed. When we started to implement the DSL, we focused on its implementation via GADTs to keep the type information and to be not wrong when we want to compare value from the given file and the database - of course, we tweak some details about the description which rely on the C-like printf.

This is the major advantage between file and conan where we are more reliable about what we consume and what we do to recognize a file.

Then, as we said and specially for the MirageOS project, we decided to abstract syscalls (or functions to get values from a file) to be able to use conan into an unikernel. It’s more about design and conan is obviously compatible with lwt, async (or multicore). But it let an usage of the file recognition in many contexts (MirageOS, Linux or Windows?).

Finally, such design splits well the project into multiples part where the core is only about the DSL and derivation of conan (such as conan-unix or conan-lwt) are more about accesses & file representation into specific contexts.

As other MirageOS projects, we implemented a fuzzer which checks that the recognition never breaks the control flow via an exception from an unimplemented feature and we tried to implement tests as much as we can.

Status of the project

The project is usable as its first release. However, we did not implemented the whole libmagic because:

  1. it’s not really essential for our purpose
  2. it’s buggy from the type-system point-of-view

That mostly means that, given the libmagic's database, we are not sure to handle all cases. And it’s probable that an error still occurs for some patterns. But this is where you come. We definitely need a large usage of conan to improve it via an interaction loop between you and us.

So we advise you that conan can fails and you should be aware about its usage. However, we ensure that we want to improve it by times and we need you to help us about that.

19 Likes

I appreciate the detailed introduction…really gives a good overview of the software!!

1 Like

great solution, while I am not convinced that guessing is beneficial in general. The data has a purpose it is either fit for (i.e. usable) or garbage. If usable, the (real) mimetype is known after the fact for free and useless before. Am I wrong?

Sometimes the mime-type is necessary for formal reasons (e.g. Content-Type http header) but still the consumer has to make sense of the actual data. No matter the declared mime-type. Be it guessed or user input.

It always smells like redundancy and untrustworthyness.

P.S.: one occasion they are useful, however, is to trick browsers into displaying or hiding things dependant on being text/foo or application/bar.

I can agree with you :slight_smile: ! conan is not the final solution to solve this particular problem - and, afterwards this implementation, I really think that such software is bad by essence. However, we can not be fully radical and blind our eyes in front of an existing ecosystem - and God knows that we did many mistakes.

For me, conan is a possible solution and then, I let the user to choose to use it or not. If you take a look on HTTP libraries (CoHTTP / httpaf / h2), they don’t care about the MIME type and focus on the protocol only - and I think it’s a good design where such problematic (recognize MIME type of files) must be on the responsibility of the end-user (not the HTTP implementer).

That’s say, if you want to implement an HTTP server and you know statically that you should provide only HTML and may be CSS documents, you should not use conan :slight_smile: . But such perfect context where you know everything about what you should transfer to your client is not common. In some context, you don’t have a full control about files that you should transfer to the client. This is where conan can come as a solution - but, again, it’s not the best one and this is on your responsibilities to make a choice.

3 Likes