In Cygwin, OCaml outputs CRLF line endings but inputs only LFs leading to End_of_file

I know there’s a glaring flaw in my mental model for how OCaml on Cygwin (i.e., on Windows but with Cygwin-provided OCaml) handles newlines and I’m hoping someone can elucidate the right way to think about this.

If I just open a file and write a string to it, OCaml will replace LFs (Unix line endings, \n) with CRLFs (Windows line endings, \r\n):

let file = "foo.txt" ;;

let chan = open_out file;;
output_string chan "a\nb\nc\nd!";;
close_out chan;;

This file’s hex dump confirms this, notice the 0d0a (\r\n CRLF Windows line endings) where the original string had only \n:

$ xxd foo.txt
00000000: 610d 0a62 0d0a 630d 0a64 21              a..b..c..d!

This is reasonable.

My confusion starts though with this idiom for reading such a file (which I saw in the Semgrep source):

let chan = open_in file ;;
let len = in_channel_length chan;;
let res = really_input_string chan len;;
close_in chan;;

This throws End_of_file exception, because in_channel_length chan is 11 (the number of bytes on disk: five characters, three pairs of CRLFs), while really_input_string’s argument appears to be in terms of the number of characters of a string to fill in: the output string should have only 8 characters since OCaml seems to replace CRLFs with LFs, so the following works:

let chan = open_in file ;;
let res = really_input_string chan (len - 3);;
close_in chan;;

I get # val res : string = "a\nb\nc\nd!" when I ask really_input_string for not the number of bytes on disk but the number of characters those bytes will be converted to.

I’m tempted to think this is a bug? On macOS, I can run really_input_string chan (in_channel_length chan) on the same file with CRLF endings and it works fine, and loads the entire string with CRLFs. I would think the same should work for Cygwin?

Or perhaps this intended behavior and we’re just using an incorrect non-cross-platform way to read files from disk? For example, I can fix the code above by either opening the channel in binary mode (open_in_bin instead of open_in) or setting the channel to binary mode after it’s opened (via set_binary_mode_in)—the resulting string contains CRLFs. Is the recommended cross-platform way to read files in binary mode?

(A third possibility might be, when we write the file we should write in binary mode to avoid OCaml’s platform-specific translation? This is challenging because in the Semgrep example above, the file is written with Common2.with_tmp_file which abstracts away the channel.)

I entirely place myself at the mercy of your greater experience—it feels like a bug that in Cygwin, when I use this natural-seeming idiom of open_out, output_string, open_in, and really_input_string, an exception is thrown, but I’m happy to be told that’s how it’s supposed to be. Thank you!

N.B. This happens both in plain Cygwin and with OCaml for Windows which makes sense since the latter is built on top of Cygwin.

I think that’s how it is supposed to work. If you are going to use really_input_string you should open your channel in binary mode. In general “text mode” is a challenging concept: even if convenient for some uses, byte-level operations often do not work the way one would expect them to on channels opened in this way.

See Windows: bad interaction between eol translation and channel_length/seek · Issue #9868 · ocaml/ocaml · GitHub for some related recent discussion on this issue.

Cheers,
Nicolás

1 Like

I think you found the bug. This should use open_in_bin. It’s not reasonable to expect a generic “read file contents” function to modify the file contents depending on the platform. Or at least not in the context of semgrep or pfff. The fewer differences due to the platform, the better. All parsers should be able to work properly with either CRLF or LF as line terminators.

We should probably discuss this elsewhere, but I recommend changing the code in pfff so that the same function or block of code both opens and closes the file. I don’t see an advantage of leaving a channel open and using seek to move to the beginning to do something else with it.

I’d change the interface to “take a file name and return its length” so we control the opening mode.

The code I recommend would be:

let read_all filename =
  let ic = open_in_bin filename in
  Fun.protect
    ~finally:(fun () -> close_in ic)
    (fun () ->
       let len = in_channel_length ic in
       really_input_string ic len
    )

(it’s a little sad that we have to implement this ourselves, but minimalism has its benefits)

1 Like

This is interesting and reminds me of a suggestion I made recently, i.e. if we define a let-operator:

let ( let& ) ch fn =
  Fun.protect ~finally:(fun () -> close_in ch) begin fun () ->
    fn ch
  end

Then we can express this as:

let read_all filename =
  let& ic = open_in_bin filename in
  let len = in_channel_length ic in
  really_input_string ic len
1 Like