How different is the behaviour of `read_line` and `Scanf.scanf "%[^\n]"`?

Hello guys.
Recently, my hobby is solving algorithm problems with OCaml.
I usually use BOJ and Hacker Rank just because they have good supports for OCaml.

While solving 1152 in BOJ, I found something strange: when receiving string input, using Scanf.scanf "%[^\n]" works fine but using read_line gives me a runtime error on the certain test case!

For those who cannot read Korean (since the problem is described in Korean), this problem is to count the number of words in the input sentence that contains spaces. The max length of the input sentence is 1,000,000 and there are no successive spaces. Also, all words are separated by a single space.

I want to simply solve this problem as follow:

let count_words str =
  let str = String.trim str in
  if str = "" then 0 else (
  let words = String.split_on_char ' ' str in
  List.length words)

let () = 
  let s = read_line () in
  print_int (count_words s)

But this gives me a runtime error!
However, just changing the input method from read_line to Scanf.scanf as follow worked just fine!

let count_words str =
  let str = String.trim str in
  if str = "" then 0 else (
  let words = String.split_on_char ' ' str in
  List.length words)

let () = 
  Scanf.scanf "%[^\n]" (fun s -> 
  print_int (count_words s)
)

Since the site does not provide the whole test cases nor the runtime error messages, so I have no clue about which input caused this error or what happened during the execution. But in my humble opinion, both methods must work fine. Am I right?
In the description about read_line function of Pervasives documentation (whose version is the same as used for grading), it says:

Flush standard output, then read characters from standard input until a newline character is encountered. Return the string of all characters read, without the newline character at the end.

Why does read_line give runtime error, while Scanf.scanf works just fine?
Thank you.

PS: I know that there are more efficient and faster solutions for this problem, and I already have 28ms of satisfactory another solution as well. So please do not argue with the complexity of the algorithm used here. I just want to know why read_line gives me a runtime error!

1 Like

It looks like read_line raises End_of_file if it encounters the end of file immediately, instead of returning an empty string: "". Perhaps one of the inputs is an empty file?

Thanks for your suggestion!
I’ve submitted with the code below that tries to catch an exception that read_line would raise, but it still won’t pass and raised runtime exception. :cry:

let count_words str =
  let str = String.trim str in
  if str = "" then 0 else (
  let words = String.split_on_char ' ' str in
  List.length words)

let () = 
  try
  (let s = read_line () in
  print_int (count_words s))
  with _ -> ()

I went down a pretty deep rabbit hole here. This code definitely shouldn’t raise an exception, so I looked into the implementation of read_line.

So, I’m looking for segfaults. There is a call to bytes_unsafe_to_string at the end, but it looks fine because the function never touches the bytes after it calls that. Yet, if you modify the code to copy the string returned by read_line, the answer appears to be accepted, e.g.:

let count_words str =
  let str = String.trim str in
  if str = "" then 0 else (
  let words = String.split_on_char ' ' str in
  List.length words)

let () = 
  try
  (let s = String.copy (read_line ()) in
  print_int (count_words s))
  with _ -> ()

Curiously, if I add a line before the main program to force a GC cycle:

let () = Gc.full_major ()

Then the crash doesn’t occur. Even checking the GC allocation policy appears to perturb something enough that the crash doesn’t occur:

let () = assert ((Gc.get ()).allocation_policy = 0)

I don’t really understand why this is the case. FWIW, the version of OCaml used by the website is 4.07.0, and it’s using the byte code compiler instead of native code. 4.07.0 has a known crash with the first-fit allocation policy, but that doesn’t appear to be in play (see the assertion added above).

I would try fuzzing, but I don’t think the OCaml fuzzing support works in bytecode mode, so I don’t know if it would matter much.

Perhaps it might be worth reaching out to the site admins to get more details about how the code is run, and perhaps which test case crashes. This certainly seems like a very strange segfault.

2 Likes

, my hobby is solving algorithm problems with OCaml.

I like doing this a lot also. I have actually done this with ocaml, scala, haskell, and java, so far. Consider trying Rosalind as well.

I like doing the simple problems as well, and rewriting the answers with different library functions and styles until I like the design.

1 Like

@bcc32 Wow, thank you so much for your kind explanation and solution!!! This really perfectly solved my question!
Indeed, I found the bug fix for OCaml 4.07.0 regarding GC crashes, so even if you said that this doesn’t appear to be in play, but I think this might be related in some way. Because, including your perturbation about Gc module, just calling any Gc library made this crash disappear. Like ignore (Gc.get ()) or ignore (Gc.stat ()) or ignore (Gc.minor ()), … etc. So it is clear that how the read_line function interoperates with garbage collector makes this very strange segfault (which I don’t know why deeply :sweat_smile:)

I’ve already requested to change compilation as native code and upgrade the version of OCaml to the admin. So I will wait for the update!
Many thanks!!!

1 Like

@kanishka
Thanks to you, I get to know this good site called Rosalind. It seems they don’t require any code submission, just output data, so it is really the sweet spot for OCaml problem solving!

I also like to rewrite the solution that I’ve wrote in other languages like c++ or python into OCaml referring to well known open source designs. Good to have someone with a hobby like me!