Input_byte not the tool to use to retrieve binary content of Unix text file?

jonathandoyle · April 21, 2018, 10:14am

Summary of issue :

I need to get the raw binary content of a Unix file, and I thought that input_byte was exactly the tool for this, but it seems not, see below. So I’m asking for the correct way to do this in OCaml if it is possible.

Context and details :

I have downloaded a static copy of an old web app for my offline private use. By default the encoding in my Firefox is “Unicode”, which makes the pages display incorrectly. They display correctly when I manually set the encoding to “Western”. Since that manual correction is a drag to re-do every time, I’m trying to edit the sources so that they are written in Unicode. Since all those pages have a <charset="windows-1252"> in their html, I first tried an iconv -f WINDOWS-1252 -t UTF-8 but the output was only incorrect in a different way.

So I tried something like this in Ocaml :

let my_channel=open_in_bin my_badly_encoded_file;;
let n=in_channel_length my_channel;;
let accu=ref([]);;
for k=1 to n do accu:=(input_byte my_channel)::(!accu) done;;
close_in my_channel;;
let raw_bytes=List.rev(!accu);;

But when I look at the contents of raw_bytes, I see that all the non-ASCII characters are not distinguished and are all represented as the singlecombination \239-\191-\189. So, input_byte is acting here like a browser misreading the characters and displaying them all with some adhoc single character.

thomas_ridge · April 22, 2018, 10:19pm

Which platform are you using? You say it is a Unix text file - should I assume you are on Linux?

perry · April 23, 2018, 1:23am

Use hexdump on the file.
Instead of doing the dance with accu, just print the bytes, in hex, as you read them…
See if you see any interesting difference. If not, perhaps the problem isn’t what you think it is.

jonathandoyle · April 23, 2018, 9:10am

No, it’s a Mac (10.13.2) I’m using.

perry · April 23, 2018, 1:30pm

Not meaningfully different for these purposes.

jonathandoyle · April 23, 2018, 5:52pm

You’re right, input_bytes gives the “same” output as hexdump, so it seems that input_bytes is not the culprit. Now I’ve found that using latin-9 as an initial encoding in iconv works, so I don’t need OCaml anymore to solve my problem. Sorry to have bothered you with that not very OCaml-like question, and thanks for the help anyway.

Topic		Replies	Views
File encoding issue - open_in_bin does not read in binary mode on a Mac, or does it ? [SOLVED] Learning	9	881	September 7, 2020
UNICODE support in Objective CAML runtime system Learning ocaml	11	2976	January 20, 2020
Handling binary data in ocaml and javascript Learning web , jsoo	4	3792	February 8, 2018
Should we not use bytes in OCaml? Learning	1	519	October 4, 2023
Missing header for plaintext documentation Site Feedback	1	80	January 24, 2025

Input_byte not the tool to use to retrieve binary content of Unix text file?

Related topics