Printing Unicode Characters on Different Platforms

It seems that Ocaml handles unicode characters in strings in utf-8 encoding. If I write

"\u{03BB}"

then the string has the greek letter lambda in utf-8 encoding.

What happens if I print this string to a file on a windows platform using a channel? In windows files are encoded in utf-16 by default. Is the string written in utf-16 encoding?

The strings sent to the channel are piped as is. If the channel goes to the console, the OCaml runtime will have already called SetConsoleOutputCP(CP_UTF8) so it’ll print correctly. If it goes to a file, it will be written to the file as is, so the escape sequence will be encoded as UTF-8.

This isn’t really true. While Windows API arguments and console output used to require UTF-16 for Unicode (the -W suffixed APIs), file encoding was mostly left to the application and most used to default to the system-wide codepage, usually something ISO-8859 based, not UTF-16.

But .NET for example has defaulted to UTF-8 for file IO ever since 1.0, and Microsoft tools and editors have been slowly moving to UTF-8 as well[1]. Nowadays, even Notepad defaults to UTF-8 files, and you can force the UTF-8 codepage at the application level to use it with the -A suffixed Windows APIs[2] as well.


  1. Although Microsoft sometimes insists on using a BOM to help detect the encoding. ↩︎

  2. This makes UTF-8 usable for almost all Windows APIs, with the exception of some specific features only available to UTF-16 file path APIs. ↩︎

Files on Windows (and most other system in use today) do not have a mandated encoding (otherwise it would be complicated to store binary data on disk!), they are just a sequence of bytes. Whatever you pass to the output channel will be stored as-is on disk(*). The question is whether programs in Windows will be able to read files encoding in UTF-8. As @debugnik mentioned, this is the case for most programs nowadays.

(*) This is not, strictly speaking, true, because Windows has a notion of opening a file in “text mode” (which OCaml can also do) which means that LF will be translated to CRLF on writing and viceversa when reading. But other bytes will still be stored as-is.

Cheers,
Nicolas

Thanks for the information.