A Proper OCaml I/O Layer

Continuing the discussion from Actual Performance Costs of OOP Objects:

Indeed, I think this is the place where OOP could have been most helpful to OCaml, but unfortunately it’s just not performant enough IMO to entrust with such an often-used functionality.

But it’s not enough. A flexible I/O layer can stream from/to a string, or a BigArray. In that case, the I/O cost disappears.

There are costs to multi-layer buffering and it’s generally not a good idea. Buffering is a nice option to have, but it should not be a default tool to get around language performance issues.

I’d love to see that.

This is indeed an issue. It’s not just types though – in async code, non-blocking rather than blocking calls are made. Nevertheless, it would still be nice to have an I/O library that somehow supports both (though I’m not sure how it would work).

I think objects just have to be avoided for a high performance functionality such as this. First class modules are much higher performance, but I’m not sure what can be done with them. Perhaps @c-cube can help here as he’s tried them out for this purpose.

Are you on an ideological crusade against objects? I mean, I understand, we spent years teaching people to avoid them… but eio rationale.md does a great job at explaining the tradeoffs here.

If we only focus on the end-user experience, not the runtime details, how would you design the Flow interface? I can think of these solutions:

(I’ll be using a simplified interface to keep it short)

Objects à la eio

class type source = object
  method read : string
end

class type sink = object
  method write : string -> unit
end

class type two_way = object
  inherit source
  inherit sink
end

class type close : object
  method close : unit
end

val read : source -> string
val write : sink -> string -> unit
val close : close -> unit

Pro:

  • two_way can be implicitly cast into sink and source
  • close is an optional method
  • it’s obvious how to implement the different types by creating a new object
  • methods with default implementations can be added to objects, just like type classes in Haskell

Cons:

  • objects?

Stdlib-like

module Source : sig
  type t
  val read : t -> string

  val make : (unit -> string) -> t
end

module Sink : sig
  type t
  val write : t -> string -> unit

  val make : (string -> unit) -> t
end

module Two_way : sig
  type t
  val to_source : t -> source
  val to_sink : t -> sink

  (* alternatively *)
  include module type of Source with type t := t
  include module type of Sink with type t := t

  val make : source -> sink -> t
end

(* what about close? *)

Pro:

  • standard ocaml

Cons:

  • must explicitly cast with Two_way.to_*
  • must provide make functions or expose the t type definitions?
  • it doesn’t scale if close is optional, at least 2x the number of modules with _closeable and _noncloseable… but we can just pretend that all are closeable?

Variants

type _ t =
  | Sink   : (string -> unit) -> [< `sink] t
  | Source : (unit -> string) -> [< `source] t
  | Two_way : (string -> unit) * (unit -> string) -> [< `sink | `source] t
  (* ... what about close? *)

val read : [> `source] t -> string
val write : [> `sink] t -> string -> unit
val close : [> `close] t -> unit

Pro:

  • subtyping works again!

Cons:

  • but subtyping is finicky, you can’t define an alias type source = [> `source] t (and forgetting the > causes composition problems)
  • should we expose the GADT or hide it behind make_* functions?
  • optional close is still issue
  • non-standard use of extensional variants, we redefined a limited form of objects with their dual

(I guess you can also replace the variants with module types, but then subtyping doesn’t work so it’s pointless.)


So, which one is best for the end-users? Or did you have another design in mind? :slight_smile:

(I don’t think we would have this conversation if objects were called extensional records or some fancy name. The API is all about subtyping here!)

6 Likes

But, again, the number of calls to the I/O layer is not going to be the number of bytes in the in-memory object, so it’s not clear to me that the cost of a method call is actually prohibitive. When I do I/O, I’m reading/writing in the range of multiple kilobytes at a time. It’s quite likely that all the other work I’m doing to build up/process those multiple kilobytes will be significantly more than the cost of a method call.

Right. This brings me back to not knowing what the actual costs of objects are. At what point are they prohibitively expensive? Are they even prohibitively expensive, given the general cost of abstractions in OCaml? I’d like to see @talex5’s measurements without any buffering to get a sense of what we’re talking about.

  • two_way can be implicitly cast into sink and source

Not implicitly, no. It’ll be easy to cast but no value-level cast in OCaml is implicit, as far as I know.

A missing possibility: first-class modules with hidden state, which are basically records built with functions closing over the hidden state. Downcasting is explicit and allocates, but is doable, unlike with records.

module type SOURCE = sig
  val read : unit -> string
end

let source_of_string (s:string) : (module SOURCE) = …

module type SINK = sig
  val write : string -> unit
end

let sink_of_buffer (buf:Buffer.t) : (module SINK) = …

module type TWO_WAY = sig 
  include SINK
  include SOURCE
end

let two_way_of_socket (sock:Unix.file_descr) : (module TWO_WAY) = …

In my experience that seems to yield decent performance, even with functions like get_char : unit -> char called for framing/metadata purpose, leaving bulk operations operate with bytes slices as always. I haven’t used an equivalent of TWO_WAY though.

4 Likes

The other thread seemed to suggest FCM may be even slower than objects (!!). I haven’t tested it myself yet.
How does downcasting look when using this?

EDIT: Forget it, you already answered there.

I’ll will not design the code with modules as you did (note that I’m not trying to argue against object in this design case, indeed it seems to be a good solution, but that I’ll do differently than you with modules).

module type Source = sig
  type t
  val read : t -> string
end

module type Sink = sig
  type t
  val write : t -> string -> unit
end

module type Two_way = sig
  include Source
  include Sink
end

(* only if we want the supertype of all these typed objects *)
module Source : sig
  type t
  val make : (module Source with type = 'a) -> 'a -> t
  val read : t -> string
end

module Sink : sig
  type t
  val make : (module Sink with type t = 'a) -> 'a -> t
  val write : t -> string -> unit
end

module Two_way : sig
  type t
  val make : (module Two_way with type t =  'a) -> 'a -> t
  val read : t -> string
  val write : t -> string -> unit
end

val read : (module Source with type t = 'a ) -> 'a -> string
val write : (module Sink with type t = 'a) -> 'a -> string -> unit

Here you also have implicit subtyping if you use read or write with a module that implements Two_way; these two functions work like Haskell type classes but with an explicit methods argument. And you only need the Source, Sink or Two_way modules if you want a container for this kind of values (for instance a list); these modules are equivalent to the code with objects.

The tradeoffs between objects and modules are : subtyping is a no-op at runtime but method access is slower with objects, subtyping allocates at runtime but method access is faster with modules (and maybe more optimisation opportunities for the compiler).

Note: the types Source.t, Sink.t and Two_way.t can be implemented with objects an their definition exposed in the interface if we want.

2 Likes

Oh you’re right indeed, you have to write (tw :> source) Thanks for the correction! I had in mind < read : etc ; .. > which doesn’t require the explicit cast :slight_smile:

For completness, I also forgot to mention:

type source = unit -> string
type sink = string -> unit
type close = unit -> unit

val read : source * _ * _ -> string
val write : _ * sink * _ -> string -> unit
val close : _ * _ * close -> unit

val source_of_string : string -> source * unit * unit

or with a GADT:

type yes = YES and no = NO

type (_, _) opt =
  | Unsupported : (_, no) opt
  | Is : 'a -> ('a, yes) opt ;;

type ('read, 'write, 'close) t =
  { read  : (unit -> string, 'read)  opt
  ; write : (string -> unit, 'write) opt
  ; close : (unit -> unit,   'close) opt
  }

You forgot one aspect: allocation is much more expensive with objects. If, for some reason, you are in a situation where you often create an object, call a few methods, then discard the object the allocation cost could dwarf everything else. Everything else, on the other hand, has low-cost allocation that can be reduced to zero for short-lived values (if the compiler can track the value during all its lifetime).

Indeed, it will depend on how your design is intended to be used. But if the code is highly structured around subtyping, objects could be interesting. We have 4 subtyping relations: two between modules and two between base types.

(* subtyping between modules *)
Two_way <: Source
Two_way <: Sink

(*subtyping between base types *)
Two_way.t <: Source.t
Two_way.t <: Sink.t

If we mostly use the type classes like functions read and write, I guess that modules will be more efficient (we use subtyping between modules)… But if we also want to use the subtyping relation between base type and containers like list, I guess that objects will be more efficient. Suppose you have a function foo : Source.t list -> foo and you want to use it with l : Two_way.t list, then you should do:

let l' : Source.t list =
  let cast : Two_way.t -> Source.t = Source.make (module Two_way) in
  List.map cast l
in foo l'

Hence, before applying foo to l, you have to upcast l to a Source.t list using map: you will traverse l two times and allocate a fresh new list just to use foo, and you’ll have redundant information in the resulting values contained in the list. But, with objects (for instance if we implement Source.t and Two_way.t with objects and exposed it in the interface), we just have to write:

foo (l :> Source.t list)

here the upcasting is a no-op.

I realized that we don’t need structural subtyping between objects if we want to avoid unnecessary allocation, even with covariant functor like list. Indeed, the function foo above is just a particular application of this more generic and type class like function:

val foo_gen : (module Source with type t = 'a) -> 'a list -> foo

(* and the `foo` function in the previous comment is just *)
let foo l = foo_gen (module Source) l

This way if you have l : Two_way.t list you can just do:

foo_gen (module Two_way) l

and if, for instance, you have a list of strings (that are sources) you don’t need to pack them in an object-like value, but you can use them directly with:

foo_gen (module String) [s1; s2; s3]

(* idem with sockets that are two_way values *)
foo_gen (module Socket) [sock1; sock2; sock3]

Finally the concrete implementation of Source.t, Sink.t or Two_way.t can be leaved as a detail and the type kept abstract in the interface without, I guess, any real lose of efficiency if you need subtyping between core types.

1 Like

Indeed! But then a limitation of type-classes is that you can’t build a list of different sources without a wrapper:

type source = Source : (module Source with type t = 'a) * 'a -> source

For sure, but it’s already the case with objects. That’s the whole purpose of the type Source.t: it’s the supertype of all the kind of sources, hence it allows you to build a list of different sources by upcasting any of them in it. But you can also defined it as follows:

type source = < read : unit -> string >
type source = unit -> string
type source = Source : (module Source with type t = 'a) * 'a -> source

All these definitions are equivalent. The question then was : should we choose the object definition and exposed it in the interface if we want to use subtyping between source and two_way? and the answer is: no. :wink: The only case where, if you kept the type abstract (even if its concrete implementation is an object), you could have some unecessary allocation is this one:

(* you have a socket *)
val socket : Socket.t

(* but at some point you wrapped it in the supertype `Two_way.t` *)
let tw = Two_way.make (module Socket) socket

(* on the other side you have a source *)
val source : Source.t

(* an now you want to mix them in a Source.t list *)
let l : Source.t list = [source; Source.make (module Two_way) tw]

(* the above code, that allocates to upcast `tw`, is conceptually
   identical to this one which doesn't  *)
let l : Source.t list = [source; (tw :> Source.t)]

Your solution would really benefit from modular implicits! :slight_smile:

Even modular explicits will be a benefit to reduce verbosity. I find it painful all this module Source with type = 'a. With the explicits part of the modular proposal, you’ll have:

module Source : sig
  type t
  val make : {S : Source} -> S.t -> t
  val read : t -> string
end

If find this syntax cleaner and easier to parse, especially if you have more than one module in parameter. Even when you want to write the make function, it’s painful:

let make (type a) (module S : Source with type t = a) s = Source ((module S), s)

(* instead of *)
let make {S : Source} s = Source ({S}, s)

(* usually I end up with defining a type alias *)
type 'a meth = (module Source with type t = 'a)

let make (type a) (m : a meth) s = Source (m, s)

(* and to define function like foo, I prefer to unpack the first-class module *)
let foo (type a) (m : a meth) l =
  let module M = (val m) in
  String.concat "" (List.map M.read l)
1 Like