Summary of issue :
I need to get the raw binary content of a Unix file, and I thought that input_byte
was exactly the tool for this, but it seems not, see below. So I’m asking for the correct way to do this in OCaml if it is possible.
Context and details :
I have downloaded a static copy of an old web app for my offline private use. By default the encoding in my Firefox is “Unicode”, which makes the pages display incorrectly. They display correctly when I manually set the encoding to “Western”. Since that manual correction is a drag to re-do every time, I’m trying to edit the sources so that they are written in Unicode. Since all those pages have a <charset="windows-1252">
in their html, I first tried an iconv -f WINDOWS-1252 -t UTF-8
but the output was only incorrect in a different way.
So I tried something like this in Ocaml :
let my_channel=open_in_bin my_badly_encoded_file;;
let n=in_channel_length my_channel;;
let accu=ref([]);;
for k=1 to n do accu:=(input_byte my_channel)::(!accu) done;;
close_in my_channel;;
let raw_bytes=List.rev(!accu);;
But when I look at the contents of raw_bytes
, I see that all the non-ASCII characters are not distinguished and are all represented as the singlecombination \239-\191-\189. So, input_byte
is acting here like a browser misreading the characters and displaying them all with some adhoc single character.