I know there’s a glaring flaw in my mental model for how OCaml on Cygwin (i.e., on Windows but with Cygwin-provided OCaml) handles newlines and I’m hoping someone can elucidate the right way to think about this.
If I just open a file and write a string to it, OCaml will replace LFs (Unix line endings, \n
) with CRLFs (Windows line endings, \r\n
):
let file = "foo.txt" ;;
let chan = open_out file;;
output_string chan "a\nb\nc\nd!";;
close_out chan;;
This file’s hex dump confirms this, notice the 0d0a
(\r\n
CRLF Windows line endings) where the original string had only \n
:
$ xxd foo.txt
00000000: 610d 0a62 0d0a 630d 0a64 21 a..b..c..d!
This is reasonable.
My confusion starts though with this idiom for reading such a file (which I saw in the Semgrep source):
let chan = open_in file ;;
let len = in_channel_length chan;;
let res = really_input_string chan len;;
close_in chan;;
This throws End_of_file
exception, because in_channel_length chan
is 11 (the number of bytes on disk: five characters, three pairs of CRLFs), while really_input_string
’s argument appears to be in terms of the number of characters of a string to fill in: the output string should have only 8 characters since OCaml seems to replace CRLFs with LFs, so the following works:
let chan = open_in file ;;
let res = really_input_string chan (len - 3);;
close_in chan;;
I get # val res : string = "a\nb\nc\nd!"
when I ask really_input_string
for not the number of bytes on disk but the number of characters those bytes will be converted to.
I’m tempted to think this is a bug? On macOS, I can run really_input_string chan (in_channel_length chan)
on the same file with CRLF endings and it works fine, and loads the entire string with CRLFs. I would think the same should work for Cygwin?
Or perhaps this intended behavior and we’re just using an incorrect non-cross-platform way to read files from disk? For example, I can fix the code above by either opening the channel in binary mode (open_in_bin
instead of open_in
) or setting the channel to binary mode after it’s opened (via set_binary_mode_in
)—the resulting string contains CRLFs. Is the recommended cross-platform way to read files in binary mode?
(A third possibility might be, when we write the file we should write in binary mode to avoid OCaml’s platform-specific translation? This is challenging because in the Semgrep example above, the file is written with Common2.with_tmp_file
which abstracts away the channel.)
I entirely place myself at the mercy of your greater experience—it feels like a bug that in Cygwin, when I use this natural-seeming idiom of open_out
, output_string
, open_in
, and really_input_string
, an exception is thrown, but I’m happy to be told that’s how it’s supposed to be. Thank you!
N.B. This happens both in plain Cygwin and with OCaml for Windows which makes sense since the latter is built on top of Cygwin.