I have a somewhat old html file and I want to rewrite it using an utf-8 encoding. My idea is to put the whole file contents into a string in an Ocaml toplevel, and use suitable find/replaces for the few characters who need it, then put the reencoded string back into the file.
I have my homemade
read_whole_file function as below, and since I use
open_in_bin in it, I naively imagined that (as the manual puts it) “no translation takes place” during reads, so that what I get is the raw, lowest-level representation of the contents of the html as a mere array of bytes.
I was wrong ! As I learned the hard way, all the non-ASCII characters in the HTML file are rendered by
\239\191\189, aka “I couldn’t read that character, sorry”.
The Ocaml manual also says however that “On operating systems that do not distinguish between text mode and binary mode, open_in_bin behaves like open_in”. My operating system is Mac OS Mojave 10.14.6, I don’t know if its Unix distinguishes between text mode and binary mode ? I would have thought it does.
How can I deal with this, using the OCaml Stdlib (or if that’s not possible, with a suitable external Ocaml tool)?
let read_whole_file filename= let janet=open_in_bin(filename) in let n=in_channel_length(janet) in let b=Buffer.create(n) in let _=Buffer.add_channel(b)(janet)(n) in let _=close_in janet in Buffer.contents b;;