Input_byte not the tool to use to retrieve binary content of Unix text file?

Summary of issue :

I need to get the raw binary content of a Unix file, and I thought that input_byte was exactly the tool for this, but it seems not, see below. So I’m asking for the correct way to do this in OCaml if it is possible.

Context and details :

I have downloaded a static copy of an old web app for my offline private use. By default the encoding in my Firefox is “Unicode”, which makes the pages display incorrectly. They display correctly when I manually set the encoding to “Western”. Since that manual correction is a drag to re-do every time, I’m trying to edit the sources so that they are written in Unicode. Since all those pages have a <charset="windows-1252"> in their html, I first tried an iconv -f WINDOWS-1252 -t UTF-8 but the output was only incorrect in a different way.

So I tried something like this in Ocaml :

let my_channel=open_in_bin my_badly_encoded_file;;
let n=in_channel_length my_channel;;
let accu=ref([]);;
for k=1 to n do accu:=(input_byte my_channel)::(!accu) done;;
close_in my_channel;;
let raw_bytes=List.rev(!accu);;

But when I look at the contents of raw_bytes, I see that all the non-ASCII characters are not distinguished and are all represented as the singlecombination \239-\191-\189. So, input_byte is acting here like a browser misreading the characters and displaying them all with some adhoc single character.

Which platform are you using? You say it is a Unix text file - should I assume you are on Linux?

  1. Use hexdump on the file.
  2. Instead of doing the dance with accu, just print the bytes, in hex, as you read them…
  3. See if you see any interesting difference. If not, perhaps the problem isn’t what you think it is.

No, it’s a Mac (10.13.2) I’m using.

Not meaningfully different for these purposes.

You’re right, input_bytes gives the “same” output as hexdump, so it seems that input_bytes is not the culprit. Now I’ve found that using latin-9 as an initial encoding in iconv works, so I don’t need OCaml anymore to solve my problem. Sorry to have bothered you with that not very OCaml-like question, and thanks for the help anyway.