Help Review the new "File Manipulation" tutorial on OCaml.org

Hey everyone,

there’s a new version of the “File Manipulation” tutorial on

https://staging.ocaml.org/docs/file-manipulation

For comparison: the old version of this tutorial is here File Manipulation · OCaml Tutorials.

Thanks for taking a look and giving feedback and suggestions for revising this! :slight_smile:

10 Likes

Thanks a lot for writing such tutorials! :pray:

As a person who started learning OCaml only several months ago, I find such articles especially helpful where I can learn OCaml bit-by-bit :100:

Great content. And this version is much better than the previous one! :clap:

2 Likes

It looks very good, great job :+1:

It has a “cookbook” feel which I quite like.

Since I know you’re open to suggestions, I thought I’d point out that maybe a dedicated “cookbook” section could make a great addition to the ocaml.org.

1 Like

I’m afraid I don’t have the time to make extensive remarks, as it would amount to a full rewrite of the tutorial. But basically:

  1. It is not a good idea to mostly eschew error handling and correct ressource handling (i.e. use of Fun.finally). It’s precisely in this kind of tutorial that good habits should be established. There are too many functions here that people will cut and paste which can leak file descriptors including the first read_from_file that claims to handle errors or the second read_from_file which is plain wrong (racy to be precise). If you want to show incorrect usage you should at least put a comment in the code.

  2. Don’t promote use of Sys.file_exists it’s a broken function which will only lead to head scratchings (or pure anger on a bad day) for both for programmers and end users. It turns permissions errors into false. Do a mkdir bla && touch bla/file.txt && chmod ug-x bla and enjoy the result of

    if Sys.file_exists "bla/file.txt" then
      print_endline "File exists."
    else
      print_endline "File does not exist."
    

    to get a an idea.

5 Likes

I’m understanding the overall sentiment here (and from the feedback on the PR itself) as this:

  1. cookbook-style presentation of different recipes / things people may want to do is helpful
  2. actual code examples show some bad practices - these must be fixed to promote good practices

Tangent: If Sys.file_exists is a broken function…

  1. is there any way to fix that in the long-term?
  2. Is anyone keeping a list of such “broken Stdlib functions”?
  3. Is there a stable package that provides a safer API to the file system?
2 Likes

The problem is that currently the function never raises, IIRC it turns all unix errors into false, the correct way would be to make it raise with Sys_error on unix errors but then maybe some programs rely on the fact that Sys.file_exists never raises.

The Unix module.

1 Like

Thank you Daniel, I opened an issue Stdlib: Make `Sys.file_exists` raise `Sys_error` in error cases, instead of returning `false` · Issue #12393 · ocaml/ocaml · GitHub

The Unix module

It looks like Windows support will be improving, and that - in the long term - we need to provide practical-minded documentation that works for both Windows and Unix users.

5 Likes

Don’t let yourself be fooled by a name, the Unix module is ill-named. It’s a carefully implemented OS abstraction library, emulating most of POSIX functions on Windows (and those that are not are documented on this page).

3 Likes

Ah sorry about that… :+1: reminds me of the only two hard things in computer science: naming things and cache invalidation.

1 Like

If you find there is someone can you ask them to add to the list something I have mentioned before and learnt the hard way, namely that the Unix.execv* functions in Stdlib are not thread-safe even though we now have domains and have had the Thread module for some time, and this is not documented in the OCaml reference even though users will assume otherwise by analogy with the underlying C functions.

This is not academic. The Lwt authors assumed them thread-safe: Lwt now automatically starts threads by default when encountering potentially blocking operations but uses Unix.execve in its Lwt_process module. This has the unfortunate feature that it will work under glibc but occasionally blow up under musl.

I am not suggesting that the code should necessarily be rewritten (maybe it can’t be) but it should be documented.

3 Likes

For curious, what is the usual workflow to update or propose a tutorial on OCaml.org?

2 Likes

Sorry for the naïve question, but can you explain what you mean by thread-safe in this context?

Thanks,
Nicolas

By “thread-safe” I mean complies with POSIX and does not give rise to random lockups. (It conditionally applies malloc which is not allowed in the child process of a multi-threaded program.)

I’m not sure I follow: which child process are we talking about? The result of calling exec* is to replace the current process with a different executable, no new process is created. Regardless, reports of thread unsafety in the unix library should be reported upstream: https://github.com/ocaml/issues.

Cheers,
Nicolas

I also don’t follow. Exec is applied after a fork in any real world program. In what circumstances would a program apply exec except after a fork? Lwt_process is a typical example. Can you set out the case to which you refer?

This is getting off-topic for the present thread, but just to round things up:

As mentioned, the issue of Unix.exec* allocating memory in multi-threaded programs should be reported upstream so that at least it can be documented.

On a related note, Unix.create_process was reimplemented on top of posix_spawn precisely to avoid this issue: https://github.com/ocaml/ocaml/pull/9573. Even when posix_spawn is not available, the fallback code in that function takes care not to allocate in the child process.

Cheers,
Nicolas

1 Like

I took the liberty of filing a bug report myself: Unix.create_process_env might not be multi-thread safe · Issue #12395 · ocaml/ocaml · GitHub.

Cheers,
Nicolas

1 Like

My brief notes as an Ocaml learner (but experienced dev).

  • “doesn’t return a file descriptor…instead…a channel” - the practical difference being what? Is it just that I can’t open a file in read/write mode?
  • There is quite a lot of “chat” before you get to examples / bullet-points. Maybe if you aren’t familiar with reading/writing files in other languages it gives vital context though.
  • Start with the “with_” example, because it is (a) shorter and (b) closes the channel correctly.
  • Follow the “with_” example with the steps it performs including the try/catch. This way learners will see why they want to use the short version if they can.
  • You need examples of writing/reading a file line-by-line immediately after that - it is likely the most common task a learner would attempt. It is unfortunate that there is no stdlib wrapper to make this less complex, but that example needs to be there.
  • No links to the relevant modules in the stdlib docs!
  • The examples for “Error Handling” don’t seem to close channels in the event of an exception. If I’ve understood correctly this could leak channels? You might need to move “Remembering to close channels” above this section to give context.
  • Maybe a note (near the bottom) about whether garbage-collection closes a channel or not?

HTH

3 Likes

I would also mention that writing directly to a final file is the bad pattern, more so if one is overwriting the existing valid file (e.g. some settings modified by user from ui, persistent state, etc) - in case filesystem runs out of space or the unfortunate kernel crash happens in the middle of the write the user is left with the partially written or empty file.
Proper way to do atomic file (over)writes is to write to the temporary file in the same directory, close, fsync and rename to final path, e.g. Devkit.Files.save_as
And no, this is not a theoretical problem, and actually a widespread mistake in many popular programs (ask me how I know).

2 Likes

Think of a tail call. A launcher program prepares arguments and environment, then launches another program as its last action. You can find examples in “the real world”, whatever that means.

2 Likes