Simplify roman utf8

right, the difficulty is to find a coherent way to treat all cases, and at some point it is a matter of choice. I probably want to stick to the original use, which was to search names from a list. If I have an Icelandic name in my list, and I have a US keyboard, I will likely type “d” for searching “ð”. Or maybe “th”, I don’t know (according to this wikipedia page, “d” is a common transliteration).
So for this purpose, the UTR-30 list is not enough.
For latin-greek letters, it’s difficult to choose because I don’t know the logic behind these.
For danish words like Aarhus (which was spelled Århus until quite recently), of course if I were Danish and trying to search for a name like this, I would try “Aarhus”. But suppose I’m from the US and trying to search a guy whose name is “Århus”; I will probably search “Arhus”, no?

I think I will add an optional parameter to choose between preferring transliterations, like “th”, “aa”, etc. or a strict “char” version, using “d” (or “t”) and “a”, and so on.
(I actually already have a “uchar_to_char” function in the library that does this – in a basic way – for a Uchar character.)

Exactly what I said! :wink:

I suppose you would be expected to type “Aarhus” since this is the international spelling of the city (this is how flight companies spell it). Suggesting “Aarhus” when “Arhus” is typed would be more a matter of spell corrections based on an edit distance. But you are right, ultimately this is a question of choices and I won’t bother you more.

it seems that theses are used only for the International Phonetic Alphabet

I have implemented most of your suggestions :wink:

1 Like

I extracted code from GeneWeb in a standalone library: https://github.com/geneweb/unidecode

I just saw that in the meantime you created your own library as well. Interface will be very different, so I guess that they can leave side by side. I am trying to design unidecode to fit the need of GeneWeb, which mean that it would need to treat millions of words in batch and be as fast as possible.

If this constraint should disappear in the future, I might consider switching to your lib.

1 Like

Great to see your library. In this matter (removing accents) there are many personal choices to make, so I’m sure it won’t behave exactly the same way as mine on some unusual utf chars. Yours is certainly better suited to you, since you designed it.
On the speed side, I’d be interested to test. My impression is that basechar should be quite fast, since it relies on uutf folding, and replacements are done via an integer Map.

I plan to add benchmarks very soon, and I definitively add your library to my benchmark suit so we will be able to compare. I’ll keep you informed when it’s done.

Note that characters support of my lib is quite incomplete. For instance, no support for Vietnamese characters yet, since I need to extract a PR from GeneWeb which will handle it. As long as support will not be the same, benchmarks will be flawed.

I renamed the library “Ubase” (like “utf base strings”) because I noticed there are too many opam packages starting with “base”… .So it’s now there: https://github.com/sanette/ubase

1 Like

I finally ran some benchmarks (using the same input as those found in your unit tests): https://github.com/geneweb/unidecode/blob/master/test/bench.ml

French
              Rate     Ubase Unidecode
    Ubase  75497/s        --      -57%
Unidecode 176846/s      134%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 144824/s        --      -64%
Unidecode 404758/s      179%        --

Unidecode appears to be faster than ubase on these inputs, but I do not know about other difference between the two libraries (languages support, memory footprint, …).

1 Like

that’s impressive; thanks for the tests

Be careful that Unidecode doesn’t seem to work with NFD normalization:

# Unidecode.decode_string "Vũ Ngọc Phan";;
- : string = "Vu\204\131 Ngo\204\163c Phan"
1 Like

Trying to see what made my code “slow”, I just changed Map.find to Map.find_opt and immediately gained 30%-50% performance! (now comparable with Unidecode: faster for French, slower for Vietnamese).
I guess I will think again now when using exceptions :wink:

With Map.find:

French
              Rate     Ubase Unidecode
    Ubase  65299/s        --      -57%
Unidecode 153583/s      135%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 123110/s        --      -65%
Unidecode 349022/s      184%        --

With Map.find_opt:

French
              Rate Unidecode     Ubase
Unidecode 153732/s        --      -19%
    Ubase 189538/s       23%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 243077/s        --      -31%
Unidecode 354489/s       46%        --
2 Likes

Intersting. Let see how both libs will evolve and how memory usage / binary size would vary, but I might consider switching to ubase then.

About Å, there are two of them.

A with a ring about (the ordinary “Å”), and the unit Ångström, which also looks like “Å” (they are actually the same letter, different use. Is that the same with “µ” ?)

They are really the same letter. Wikipedia says “Unicode also has encoded U+212B Å ANGSTROM SIGN. However, that is canonically equivalent to the ordinary letter Å. The duplicate encoding at U+212B is due to round-trip mapping compatibility with an East-Asian character encoding, but is otherwise not to be used.”

@steinuil , @dbuenzli is there such a way? I’d like to port the Go snippet

transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool {
	return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}), norm.NFC)

that does exactly that to Ocaml, preferably using uucp and friends.

Two years later in a discussion of 37 messages, not sure about which way you are talking about.

I don’t think that should pose any problem.

1 Like

Thank you, I’ll bite through.

The lynx text web-browser also makes these kind of character translations.
You could have a look at how it does this. You could just create an html page with the characters to translate and open it with lynx. You can also dump the result to a text file with links -dump.

1 Like