How to sort a string in Ocaml?

What is the proper way to sort a string alphabetically in Ocaml? Is there a standard function to do it? For example: sort “cba” => “abc”

One way is to create an array with the characters of the string and sort that:

let sort s =
  let n = String.length s in
  let a = Array.init n (fun i -> s.[i]) in
  Array.sort Char.compare a;
  String.init n (fun i -> a.(i))

Regard with @nojb’s answer, you might also use to_seq of_seq to convert between them

let sort s = 
  String.to_seq s |> List.of_seq |> List.sort Char.compare |> List.to_seq |> String.of_seq;;


let sort s = 
  let a = s |> String.to_seq |> Array.of_seq in
    Array.sort Char.compare a;
    a |> Array.to_seq |> String.of_seq;;
1 Like

Nojb and Arbipher, thank you for your help. I am just curious why something like it is not a part of a standard library? Do you know the reason? Thanks again.

Can you explain your use case for this? It doesn’t seem like the sort of generally-useful function that would go into a standard library.

why something like it is not a part of a standard library? Do you know the reason?

I think a compelling and general need for the function should be demonstrated before adding something to a standard library. It’s less that there is a reason why the function is not there, and more that there has not yet been a reason to add it in the first place.

Ocaml has Array.sort and List.sort but no String.sort (or something similar). Every modern language has some kind of string sort - or at least I thought they have until I’ve met Ocaml :slight_smile: I honestly believed that I just couldn’t find it (it’s hidden somewhere, uses diff name, etc)

Every modern language has some kind of string sort

I wasn’t sure about this, so I checked a couple of popular languages: JavaScript (no), Ruby (no), Java (no), C++ (yes, for all “iterator-based” containers), C (yes, because strings are just arrays), Haskell (yes, because strings are just lists), Python (yes, but sorted returns a list instead of a string).

Notably, with C/C++ and Python, the string sorting isn’t specialized to strings, but is instead a consequence of a general-purpose sorting functionality for “containers”/“iterables”.

OCaml hasn’t had very strong standard-library support for generic container functions until recently (with the addition of the Seq module). Perhaps a Seq.sort function would be useful?

1 Like

I think Bytes.sort could be proposed for integration upstream, by analogy to Array.sort, but it is not a very common function so am not sure it will be accepted for inclusion.

A string and a collection (like list or array) of chars are subtly identical or not depends on languages (more subtly if the encoding is involved e.g. strings, bytes, runes and characters in Go.

To be honest I am not very familiar with OCaml’s design philosophy on this but I agree with String.sort will be convenient.

As far as I can see, ocaml’s Seq module implements a lazy (non-memoizing) list, rather than providing generic container functions. Seq’s type 'a t is of type unit -> 'a node (a thunk).

1 Like

bcc32@ asked what your use-case is, and I’d lke to understand it, too. If you’re asking for it basically because it seems like it’s missing, and should be included for orthogonality reasons, that’s a pretty weak argument.

bcc32@ noted that the reason some modern languages have a “sort” on strings, is that strings are implemented in a manner that makes them arrays or sequences by default. In short, the “inheritance” comes via object-orientation, or literal representation (as in C arrays). But in ML-family languages, there is no such inheritance, and while it might be “nice” to have it, it sure isn’t settled common opinion that that’s the way to go.

This is why you’ll see fancy add-on libraries that (for instance) make list-maps, hashtables, avl trees, etc, all look like the same sort of associative-lookup datastructure, but you won’t find that in the lowest-level base libraries.

But again, I think it would be useful to understand your use-case.

I don’t have any special case. I just started learning Ocaml and I am going through some sample exercises. Some of them requires string sorting (for example anagram puzzle) and I was surprised that I couldn’t find any solution/discussion how to do it in Ocaml. I would like to thank everyone for their answers, Its’s being a great help!

A very old-school way to solve this problem would be to first write a quicksort implementation with arguments supplying the compare and swap operations.

String sorting seems somewhat pointless unless you specify the exact semantics and then you run into lots of questionable cases, like “what is a character”? I wouldn’t want e.g. my South Africa flag :south_africa: to be “sorted” to turn into the Azerbaijan flag :azerbaijan:. And then, you also need to have locale-aware sorting, since the glyph Ä is treated differently in Swedish (where it is a proper letter and placed at the end of the alphabet) and German (where it is arguable what it exactly is and where it would end up when sorting). And this is just common cases in Western European languages, there surely are many more issues on how to sensibly sort.

Or you can just don’t care and do it incorrectly. I would think adding a naïve implementation of string sorting into the standard library is not a good idea.

5 Likes

This brings up a hidden assumption of this thread–OCaml strings are not actually Unicode (UTF-8-encoded) strings. They are actually byte arrays. So things like ‘sort’, ‘uppercase’, or even ‘character at index n’ don’t work as you would expect for Unicode strings.

All UTF-8 encoded strings in any programming language are byte arrays, in the sense that they are composed of a sequence of 8-bit code units. Some programming languages may provide iterators to iterate over such sequences by whole unicode code points but in my experience these are of very limited utility: you cannot for example mutate a UTF-8 string because substituting one unicode code point for another may result in the string having a different length in code units. Converting between upper and lower case is more widely supported and useful but that can itself produce surprising results - for example, lower case German esszet (ß) is one code point in lower case but two code points in upper case (SS or SZ) by traditional orthography.

The original poster’s suggestion was that the internal sorting of a unicode string (“cba” => “abc”) would be useful. That seems to me to be unlikely. On the other hand, sorting a sequence of strings (say, a list of strings) can have real utility, but it has its own difficulties. I know you know this but for the sake of the original poster, even to compare unicode code points (let alone different glyphs) you have to take account of different representations of the same character, including ligatures and combining characters, which requires prior normalization of the strings (unicode provides NFD, NFC, NFKD and NFKC normalization forms to test equivalence). In any event, it is not as simple as “cba” => “abc”.

3 Likes