UNICODE support in Objective CAML runtime system


I am using String_val(v) macro from the mlvalues.h to extract the contents of string that is received from OCaml code. It is working fine for normal string. But does it also support UNICODE strings?

Thank you in advance.

Yes, because “unicode string” is not really a meaningful concept.

You are mixing a container (string, aka an array of bytes) with a standard for describing human scripts (unicode) that can be potentially encoded inside strings with a specific encoding like utf-8 or utf-16.

The OCaml FFI is only concerned with the container type string, not the interpretation of its contents.


For working with Unicode check out these:

Hello XVilka,
Thank you for these suggestions, but I do not want use any external libraries like these.

Perhaps, I could not specify my requirement clearly.
The fact is that, my string that is sent from OCAML code to C interface is encoded in UTF-8.

The string is a path of a zip file and I want to extract it using zlib library functions.

Previously I was using


to extract the zip file contents.

However, it was not working with the path which is encoded in UTF-16

I resolved this issue with unzOpen2 and a callback as below:

CAMLprim value ml_minizip_unzopen(value path) {
    unzFile uzf;
    zlib_filefunc_def zfd;
    zfd.zopen_file = &open_zip_file_callback;
    uzf = unzOpen2((String_val(path)),&zfd);    
    if (uzf != NULL) {
        value result = caml_alloc_custom(&unzfile_ops, sizeof(unzFile), 0, 1);
        ML_UNZFILE(result) = uzf;
    } else caml_failwith("Minizip.unz_open: failure");

voidpf ZCALLBACK open_zip_file_callback(voidpf opaque, const char* filename, int mode) {    
    FILE* file = NULL;
    LPWSTR mode_fopen = NULL;
        mode_fopen = L"rb";
            mode_fopen = L"r+b";
            if (mode & ZLIB_FILEFUNC_MODE_CREATE)
                mode_fopen =L"wb";
    if ((filename!=NULL) && (mode_fopen != NULL)) {     
        wchar_t  ws[strlen(filename)];
        MultiByteToWideChar( CP_UTF8 , 0 , filename , -1, ws , sizeof(ws));     
        file = _wfopen(ws, mode_fopen);
    return file;    

Now, this is working fine.

If your issue was transcoding to windows UTF-16 encoding, you could have used the functions described in the interfacing with Windows Unicode APIs section of the manual.

1 Like

I tried that too. But it did not work for me.

That’s interesting to know. Do you remember what was your issue?

If I understand correctly you are compiling your code with UNICODE defined, in which case the function unzOpen expects a “wide” (UTF-16) encoded string. As you have observed, String_val will “typically” return UTF-8-encoded strings (by design).

So indeed, you need to recode the string data before you call unzOpen; however the callback is not necessary. You can simply set up your MultiByteToWideChar logic before the call to unzOpen. You can also use the function caml_stat_strdup_to_os, as suggested by @octachron.

But, the call unzOpen expects const char* value and not multi-byte.

I am not very familiar with this library, but I thought that one could use unzOpen64 in this case.

This unzOpen() ? The last line of the comment seems pertinent.

extern unzFile ZEXPORT unzOpen OF((const char *path));
extern unzFile ZEXPORT unzOpen64 OF((const void *path));
  Open a Zip file. path contain the full pathname (by example,
     on a Windows XP computer "c:\\zlib\\zlib113.zip" or on an Unix computer
     If the zipfile cannot be opened (file don't exist or in not valid), the
       return value is NULL.
     Else, the return value is a unzFile Handle, usable with other function
       of this unzip package.
     the "64" function take a const void* pointer, because the path is just the
       value passed to the open64_file_func callback.
     Under Windows, if UNICODE is defined, using fill_fopen64_filefunc, the path
       is a pointer to a wide unicode string (LPCTSTR is LPCWSTR), so const char*
       does not describe the reality