Printing Unicode Characters on Different Platforms

hbr · January 8, 2024, 10:30am

It seems that Ocaml handles unicode characters in strings in utf-8 encoding. If I write

"\u{03BB}"

then the string has the greek letter lambda in utf-8 encoding.

What happens if I print this string to a file on a windows platform using a channel? In windows files are encoded in utf-16 by default. Is the string written in utf-16 encoding?

amongonz · January 8, 2024, 12:58pm

The strings sent to the channel are piped as is. If the channel goes to the console, the OCaml runtime will have already called SetConsoleOutputCP(CP_UTF8) so it’ll print correctly. If it goes to a file, it will be written to the file as is, so the escape sequence will be encoded as UTF-8.

This isn’t really true. While Windows API arguments and console output used to require UTF-16 for Unicode (the -W suffixed APIs), file encoding was mostly left to the application and most used to default to the system-wide codepage, usually something ISO-8859 based, not UTF-16.

But .NET for example has defaulted to UTF-8 for file IO ever since 1.0, and Microsoft tools and editors have been slowly moving to UTF-8 as well^[1]. Nowadays, even Notepad defaults to UTF-8 files, and you can force the UTF-8 codepage at the application level to use it with the -A suffixed Windows APIs^[2] as well.

Although Microsoft sometimes insists on using a BOM to help detect the encoding. ↩︎
This makes UTF-8 usable for almost all Windows APIs, with the exception of some specific features only available to UTF-16 file path APIs. ↩︎

nojb · January 8, 2024, 2:28pm

Files on Windows (and most other system in use today) do not have a mandated encoding (otherwise it would be complicated to store binary data on disk!), they are just a sequence of bytes. Whatever you pass to the output channel will be stored as-is on disk(*). The question is whether programs in Windows will be able to read files encoding in UTF-8. As @amongonz mentioned, this is the case for most programs nowadays.

(*) This is not, strictly speaking, true, because Windows has a notion of opening a file in “text mode” (which OCaml can also do) which means that LF will be translated to CRLF on writing and viceversa when reading. But other bytes will still be stored as-is.

Cheers,
Nicolas

hbr · January 9, 2024, 8:55am

Thanks for the information.

Topic		Replies	Views
File encoding issue - open_in_bin does not read in binary mode on a Mac, or does it ? [SOLVED] Learning	9	898	September 7, 2020
In Cygwin, OCaml outputs CRLF line endings but inputs only LFs leading to End_of_file Learning	4	792	June 18, 2021
Should the official OCaml source charset be Unicode/UTF-8? Ecosystem	1	1411	May 29, 2018
Input_byte not the tool to use to retrieve binary content of Unix text file? Learning	5	753	April 23, 2018
Deriving, Format-ting and unicode Learning format , unicode , deriving	3	995	February 20, 2022

Printing Unicode Characters on Different Platforms

Related topics