Module 'Stream' removed from 5.0 standard library?

Gopiandcode · July 30, 2022, 6:09am

I’m curious what the difficulties with fold are exactly? From some initial experiments, it seems like with an approriate type alias, fold forms a fairly simple interface much like iter:

type ('acc, 'vl) fold = 'acc -> ('acc -> 'vl -> 'acc) -> 'acc
(** [('acc, 'vl) fold] represents a fold of an accumulator ['acc] over a sequence of values of type ['b] *)

let list_to_fold (ls: 'b list) : ('a, 'b) fold =
  fun init f -> List.fold_left f init ls

let rev_list_of_fold: (_, 'b) fold -> 'b list = fun f ->
  f [] (fun acc vl -> vl :: acc)
  
let array_to_fold (ls: 'b array) : ('a, 'b) fold =
  fun init f -> Array.fold_left f init ls

let rev_seq_of_fold: (_, 'b) fold -> 'b Seq.t = fun f ->
  f Seq.empty (fun acc vl -> Seq.cons vl acc)

let seq_to_fold: 'b Seq.t -> (_, 'b) fold = fun seq ->
  fun init f -> Seq.fold_left f init seq

module Map (O: Map.OrderedType) = struct
  include Map.Make(O)

  let map_of_fold (f: (_, (key * 'a)) fold) : 'a t = 
    f empty (fun acc (key, vl) -> add key vl acc)

  let map_to_fold (map: 'a t) : (_, (key * 'a)) fold =
    fun init f -> fold (fun k vl acc -> f acc (k,vl)) map init

end

Maybe I’m missing something?

gasche · July 30, 2022, 2:50pm

I think that the version of Map.of_fold that @raphael-proust had in mind is not your version that repeatedly adds for O(n log n) complexity, but a version that assumes that the input fold is sorted, and builds a balanced map/tree from it in linear time. My intuition tells me that this is (not obvious but) easier to write if you control the consumption of the sequence than if you need to reify your own state as a fold accumulator.

hhugo · July 31, 2022, 7:14am

To illustrate @gasche 's reply, here is some code found in base that construct a map in O(n) and has control over the consumption of the input data. base/map.ml at a678184f2c989e8c3a1ecae8b1b46ec51a61bcb3 · janestreet/base · GitHub

dbuenzli · July 31, 2022, 8:41am

Right, but how often you need that in practice remains to be shown.

I still think Seq was a rather unfortunate addition to the Stdlib and that we should rather lure people into folds for the use case of “data structure transfers”. More than often there’s no need to allocate all these cons cells.

bluddy · July 31, 2022, 9:00am

What we want is a format that allows us to iterate over elements of a data structure without allocating the whole data structure at each point i.e. effective deforestation. This permits us to use functional basic operations such as map rather than complex folds, groupby operations etc.

Rather than having

massive_list
|> List.map (fun x -> x + 1)
|> List.map (fun x -> x * 3)
...

we can just use Seq. That’s the idea, anyway. In reality, you want to use Iter whenever possible to do this and get maximum efficiency. If you can’t use Iter (e.g. for more complex needs), Seq is a decent replacement, except that the allocations happen anyway (but not necessarily all of them if you only take some of the list).

dbuenzli · July 31, 2022, 9:45am

Maybe that’s what you want :–) But then again iterating over the elements of a data structure is a rather narrow use case for which folds and iters will mostly do (and in the rare case when you are too bothered by the inversion of control you will with 5.0 be able to reinvert it)

What I want is iterate, transform and combine the elements of streaming ressources at which point you can’t:

So yes, still unconvinced the Seq addition was a good idea.

rizo · July 31, 2022, 10:33am

One small problem with just pure folds is that they require exceptions for early termination (for functions like take). This is normally resolved by annotating the acc with a “Done” or “Continue” variant:

type 'acc reduced =
  | Done of 'acc
  | Continue of 'acc

type ('acc, 'vl) fold =
  'acc -> ('acc -> 'vl -> 'acc reduced) -> 'acc

Of course this results in more allocations.

By the way, this approach is a starting point for implementing transducers from Clojure.

rizo · July 31, 2022, 10:38am

Producer-driven iterators like iter/fold can only ensure safety for input resources. The simple versions of these models will not work for terminating downstream or intermediate resources.

dbuenzli · July 31, 2022, 12:01pm

Sure, I didn’t claim folds and iters do it for that. My stance was always that something is needed in the stdlib for that but that Seq it’s not, since folds and iters can mostly be used for what people advertise Seq for and moreover it has no good story for ressources.

Mind you, I also went a few times over the whole iteratee saga…

geoffder · July 31, 2022, 4:14pm

I see this kind of example a lot, but why would it be so desirable to write it this way rather than composing one function and mapping/folding once?

c-cube · July 31, 2022, 5:23pm

Readability. In a more realistic example, operations won’t be as trivial, and map |> filter_map |> flat_map reads a lot better than a nested fold imho.

c-cube · July 31, 2022, 5:41pm

Sure, I didn’t claim folds and iters do it for that. My stance was always that something is needed in the stdlib for that but that Seq it’s not

And you failed to propose any alternative since 2016 besides “just use fold” (which could already be done — I still use Iter often when it fits). All you had to do was to propose a resource-safe iterator abstraction between 2016 and 2018, even just a convincing code snippet in the PRs for Seq.

In the end, Seq is a simple abstraction that shipped because, while it’s not great, it’s good enough for a broad range of use cases (broader than fold/iter: see re for a concrete example). Waiting for typed effects like you suggested in 2016 still hasn’t materialized in 2022. In a sense it’s an instance of “worse is better” in the stdlib (just like list).

That aside, if we really wished to have a resource-safe iterator abstraction, what would it even look like today?
My understanding is that even in Haskell they have trouble finding a great functional iterator abstraction that works with resources and IOs, and they have more powerful tools than us for that kind of thing. Maybe @rizo can comment on that?

bluddy · August 1, 2022, 5:52am

A part of functional programming is the clarity. A complicated fold that does a whole bunch of things (perhaps including modification of global state), or a similar recursive function, is much harder to understand than operations that do one thing at a time. Breaking down operations also helps with reuse.

geoffder · August 1, 2022, 6:54am

I feel like that’s a bit of a straw man, as you can write a multi step function f to give to a singular fold (or map if no filtering is required) that is written in a clean way, without side effects, perhaps even composed of reusable functions.

bluddy · August 1, 2022, 7:40am

Ultimately we’re getting into a discussion of loops vs. functional constructs. You could do everything with while and for loops, or with iter. So why don’t you? Well, we like to have more constrained functions in functional programming. Having small functions that are very constrained in their abilities (like map) is considered better because they can only do so much by default. fold is one step away from a loop: it’s very powerful, but generally accesses the accumulator for whatever computation it wants to produce. But a fold is also both more powerful and more cumbersome compared to a map: it has to access the correct sub-member of the accumulator. If you later need to accumulate more things, you now need to rearrange it so every part of the fold function reads the correct element in the accumulator tuple. Also if you need to rebuild the same data structure as the one you read, you have to put in effort into rebuilding the appropriate part of the accumulator. All of this comes for free with a map, plus a map is more constrained, meaning that we know from a quick glance what it does. So if we can express an algorithm via a few maps, maybe a filter, and perhaps one tiny, easily understood fold with a simple accumulator, we prefer to do that in general: every stage is constrained by its function and is easily understood. e.g. Remove the filter, and you understand immediately how the algorithm is affected

However, there’s a price to be paid for this in the form of allocation of data after every application of a transformation. In Haskell, they use deforestation to get around this cost. We don’t live in a pure world and cannot do that, but we can use iterators such as Iter and Seq to get a very similar result.

raphael-proust · August 1, 2022, 8:10am

It is a fairly simple interface indeed.

But you forgot the type existential because I might be using the same fold for different types of 'acc.

Also, I might be switching between the different types during the traversal (e.g., doing normal processing but after an error just logging the unprocessed elements to inform the user) so can you put the existential at each iteration?

dbuenzli · August 1, 2022, 3:50pm

Ah I didn’t know that was my duty and all I had to do I should have known better :–)

I’m afraid but I only make proposals for the stdlib when I’m certain about their design.

The dual of this is that I do not support additions to the stdlib unless they are really fit for purpose or solve a clear eco-system interoperability problem.

It was quite clear Seq didn’t tick these boxes but that doesn’t mean I have to propose something else. Seq’s impact on the ecosystem is largely anedoctical and it could perfectly have lived as a standalone library. It just added a little bit more official noise on the broken options you have on your hands to do streaming processing.

geoffder · August 1, 2022, 4:03pm

I suppose I left it out of my first messages by mistake, but I would certainly use filter/filter_map instead of fold if not in need of an accumulator. I just don’t think that it is some violation of functional principles to compose your functions before giving them to filter_map rather than composing via a pipeline of traversals.

c-cube · August 1, 2022, 4:25pm

If you want to keep whining about it without any constructive proposal? I think it’s fair, yes.

Blockquote Seq’s impact on the ecosystem is largely anedoctical and it could perfectly have lived as a standalone library

Except for the part where it needs to have access to each of Stdlib’s type definition, to implement Map.S.to_seq and the likes. I think we need better data to assess the impact on the ecosystem though. I think someone has a very nice semantic grep on opam, forgot the name, but if they could run it to see how many packages use Seq somewhere I’m curious.

dbuenzli · August 1, 2022, 4:44pm

Ah people always fail to see that doing nothing is also a constructive proposal, especially when there’s no fundamental problem.

In any case not sure where you saw whining. Maybe you think pointing out defects in the things you propose or warning people about the inadequacy of the things they use is whining.

Then so be it, but it also means there’s little point for me in having any discussion with you. I’m interested in technical excellence not people who cling on their little ideas.

Topic		Replies	Views
Is support from 4.05.0 enough these days? Ecosystem	1	556	January 26, 2022
OCaml 4.14.0 is released Ecosystem	14	2966	April 1, 2022
Accessing recent additions to the stdlib (`Seq` 4.14) from older versions of OCaml Learning opam , seq	7	587	July 11, 2023
OCaml 5.0.0 is out! Ecosystem	8	35451	January 10, 2023
Camlp5 and OCaml 4.08 Community	14	3531	March 11, 2021

Module 'Stream' removed from 5.0 standard library?

Related topics