UNICODE support in Objective CAML runtime system

NavnathKumbhar · January 9, 2020, 12:42pm

Hello,

I am using String_val(v) macro from the mlvalues.h to extract the contents of string that is received from OCaml code. It is working fine for normal string. But does it also support UNICODE strings?

Thank you in advance.

octachron · January 9, 2020, 1:14pm

Yes, because “unicode string” is not really a meaningful concept.

You are mixing a container (string, aka an array of bytes) with a standard for describing human scripts (unicode) that can be potentially encoded inside strings with a specific encoding like utf-8 or utf-16.

The OCaml FFI is only concerned with the container type string, not the interpretation of its contents.

XVilka · January 10, 2020, 4:16am

For working with Unicode check out these:

Camomile library
Uutf/Uunf/Uuseg/Uucp/Uucd libraries

NavnathKumbhar · January 10, 2020, 5:36am

Hello XVilka,
Thank you for these suggestions, but I do not want use any external libraries like these.

NavnathKumbhar · January 17, 2020, 7:16am

Perhaps, I could not specify my requirement clearly.
The fact is that, my string that is sent from OCAML code to C interface is encoded in UTF-8.

The string is a path of a zip file and I want to extract it using zlib library functions.

Previously I was using

unzOpen((String_val(zip_path)));

to extract the zip file contents.

However, it was not working with the path which is encoded in UTF-16

I resolved this issue with unzOpen2 and a callback as below:

CAMLprim value ml_minizip_unzopen(value path) {
    unzFile uzf;
    CAMLparam1(path);
    zlib_filefunc_def zfd;
    fill_fopen_filefunc(&zfd);
    zfd.zopen_file = &open_zip_file_callback;
    uzf = unzOpen2((String_val(path)),&zfd);    
    if (uzf != NULL) {
        value result = caml_alloc_custom(&unzfile_ops, sizeof(unzFile), 0, 1);
        ML_UNZFILE(result) = uzf;
        CAMLreturn(result);
    } else caml_failwith("Minizip.unz_open: failure");
}

voidpf ZCALLBACK open_zip_file_callback(voidpf opaque, const char* filename, int mode) {    
    FILE* file = NULL;
    LPWSTR mode_fopen = NULL;
    if ((mode & ZLIB_FILEFUNC_MODE_READWRITEFILTER)==ZLIB_FILEFUNC_MODE_READ)
        mode_fopen = L"rb";
    else
        if (mode & ZLIB_FILEFUNC_MODE_EXISTING)
            mode_fopen = L"r+b";
        else
            if (mode & ZLIB_FILEFUNC_MODE_CREATE)
                mode_fopen =L"wb";
    
    if ((filename!=NULL) && (mode_fopen != NULL)) {     
        wchar_t  ws[strlen(filename)];
        MultiByteToWideChar( CP_UTF8 , 0 , filename , -1, ws , sizeof(ws));     
        file = _wfopen(ws, mode_fopen);
    }
    return file;    
}

Now, this is working fine.

octachron · January 17, 2020, 8:41am

If your issue was transcoding to windows UTF-16 encoding, you could have used the functions described in the interfacing with Windows Unicode APIs section of the manual.

NavnathKumbhar · January 17, 2020, 12:36pm

I tried that too. But it did not work for me.

octachron · January 17, 2020, 1:28pm

That’s interesting to know. Do you remember what was your issue?

nojb · January 18, 2020, 10:53am

If I understand correctly you are compiling your code with UNICODE defined, in which case the function unzOpen expects a “wide” (UTF-16) encoded string. As you have observed, String_val will “typically” return UTF-8-encoded strings (by design).

So indeed, you need to recode the string data before you call unzOpen; however the callback is not necessary. You can simply set up your MultiByteToWideChar logic before the call to unzOpen. You can also use the function caml_stat_strdup_to_os, as suggested by @octachron.

NavnathKumbhar · January 20, 2020, 5:32am

But, the call unzOpen expects const char* value and not multi-byte.

nojb · January 20, 2020, 6:38am

I am not very familiar with this library, but I thought that one could use unzOpen64 in this case.

Chet_Murthy · January 20, 2020, 6:56am

This unzOpen() ? The last line of the comment seems pertinent.

extern unzFile ZEXPORT unzOpen OF((const char *path));
extern unzFile ZEXPORT unzOpen64 OF((const void *path));
/*
  Open a Zip file. path contain the full pathname (by example,
     on a Windows XP computer "c:\\zlib\\zlib113.zip" or on an Unix computer
     "zlib/zlib113.zip".
     If the zipfile cannot be opened (file don't exist or in not valid), the
       return value is NULL.
     Else, the return value is a unzFile Handle, usable with other function
       of this unzip package.
     the "64" function take a const void* pointer, because the path is just the
       value passed to the open64_file_func callback.
     Under Windows, if UNICODE is defined, using fill_fopen64_filefunc, the path
       is a pointer to a wide unicode string (LPCTSTR is LPCWSTR), so const char*
       does not describe the reality
*/

Topic		Replies	Views
How to access the module Uutf.String.UTF_8 Learning	23	4556	March 28, 2018
Newbie question: Unbound module Uchar.Utf8 Learning	5	244	August 5, 2024
Unicode in OCaml source code? Learning	12	4591	December 8, 2017
[ANN] Camomile 2.0.0 is out! Ecosystem announce	14	1376	April 4, 2023
Deriving, Format-ting and unicode Learning format , unicode , deriving	3	986	February 20, 2022

UNICODE support in Objective CAML runtime system

Related topics