File encoding issue - open_in_bin does not read in binary mode on a Mac, or does it ? [SOLVED]

I have a somewhat old html file and I want to rewrite it using an utf-8 encoding. My idea is to put the whole file contents into a string in an Ocaml toplevel, and use suitable find/replaces for the few characters who need it, then put the reencoded string back into the file.

I have my homemade read_whole_file function as below, and since I use open_in_bin in it, I naively imagined that (as the manual puts it) “no translation takes place” during reads, so that what I get is the raw, lowest-level representation of the contents of the html as a mere array of bytes.

I was wrong ! As I learned the hard way, all the non-ASCII characters in the HTML file are rendered by open_in_bin as \239\191\189, aka “I couldn’t read that character, sorry”.

The Ocaml manual also says however that “On operating systems that do not distinguish between text mode and binary mode, open_in_bin behaves like open_in”. My operating system is Mac OS Mojave 10.14.6, I don’t know if its Unix distinguishes between text mode and binary mode ? I would have thought it does.

How can I deal with this, using the OCaml Stdlib (or if that’s not possible, with a suitable external Ocaml tool)?

let read_whole_file filename=
  let janet=open_in_bin(filename) in
  let n=in_channel_length(janet) in
  let b=Buffer.create(n) in
  let _=Buffer.add_channel(b)(janet)(n) in
  let _=close_in janet in
  Buffer.contents b;;

Don’t rely on what the toplevel gives you back, it’s a rendering it will depend on a lot of things (e.g. your terminal encoding).

open_in_bin does what it says: it reads bytes.

If your file is in latin-1 what you can do is read it byte-by-byte ("input_char") use Uchar.of_char on these and add them to your buffer with Buffer.add_utf_8_uchar.

A simpler way is to simply use the iconv cli tool with iconv -f ISO-8859-1 -t UTF-8. Alternatively if you use emacs open your file and use M-x set-buffer-file-coding-system.

I feel there has been some kind of misunderstanding, so let me clarify by adding one more snippet :

In the code above, the first line renders the text string in a terminal-dependent way that shouldn’t be trusted, so far we completely agree. Similarly, the text_length variable does not not count the number of characters in the text, but the number of bytes in some encoding.
But what about the last line ? Are you saying also that the two sequences of 239,191,189s there are also “terminal-dependent, unreliable output” ? I thought these were the raw bytes.

Doesn’t work for me, my web navigators all render the non-ASCII chars as “¿½” in the new file (even though I changed the <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> in the header to <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">).

Hi,

Are you sure those bytes are not in the original file? It is pretty much impossible for the stdlib input routines to do anything but give you the raw bytes in the file (modulo text mode translation, which only happens on Windows).

Cheers,
Nicolás

These are certainly the raw bytes. But I don’t know how you produced your string in the first place and what they are supposed to be.

Are you reading these over the file:// protocol or over http:// ? In the latter case you need to check that your webserver is not adding headers itself, these will override those of the file.

In any case hexdumping your files should be able to confirm if you have the right sequences at the file level.

Well you should, because I spent some effort in my initial post to explain how I obtain the file’s content using a short read_whole_file function of mine, which uses open_in_bin.

To be clear, I’m not only interested in finding a solution that works, I also want to know where I went wrong in my first trial.

Are we agreed, then, that my function read_whole_file using open_in_bin is expected to produce a string that will contain the raw bytes without any interpretation or “encoding” ? Interpretation will occur only if, for example, I ask the terminal to display me the string.

If so, it is still beyond me how it is possible that the DIFFERENT non-ASCII characters in the html file (web navigators render them fine) all appear in that string as the SAME sequence ‘\239\191\189’, as though open_in_bin did some translation effort after all even though it’s not expected to.

Those are completely local files on my home Mac, so the file:// protocol is used.

If I tell you I don’t understand something please don’t tell me I should, I’m taking my time to try help you.

I can’t infer that fact from the screenshoot, I cannot guess what you are doing.

I’m afraid I cannot make sense out of that paragraph. You said the characters do not render fine. It would also likely help if you could spell out which characters you are talking about and what you do more precisely, what you expect and what you get.

In any case I think we can all guarantee you that open_in_bin will never, ever transform the bytes you are reading from a file.

Apologies if that seemed offensive to you, and thank you so much for your patience.
I belatedly realized where the problem really come from, and that it had nothing to do with OCaml : to edit my html file I used the Atom text-editing app. Since Atom happened to be unable to find the correct encoding in the html file, doing a minuscule edit and saving from Atom was enough to transform all the non-ASCII characters into \239\191\189 characters.

Absolutely. That was my main doubt as this thread developed, and it’s now fully resolved.

1 Like

Aha yes I should have looked that up, these are the decimal escapes for the UTF-8 encoding of the unicode replacement character U+FFFD which is used for decoding errors or when when you transcode something to unicode and cannot respresent it.

You can check these things with ucharinfo distributed with uucp. But it does not support decimal escapes, maybe it should add that since that’s what OCaml gives us (I thought this was changed at some point):

> ucharinfo \xef\xbf\xbd
Name: REPLACEMENT CHARACTER
Uchar: U+FFFD
Age: 1.1
Block: Specials
Script: Zyyy
Script_extensions: Zyyy
General_Category: So
Decomposition_Mapping: � (U+FFFD)
UTF-8: \xEF\xBF\xBD
UTF-16BE: \xFF\xFD
UTF-16LE: \xFD\xFF