Simplify roman utf8

sanette · September 26, 2019, 6:03am

right, the difficulty is to find a coherent way to treat all cases, and at some point it is a matter of choice. I probably want to stick to the original use, which was to search names from a list. If I have an Icelandic name in my list, and I have a US keyboard, I will likely type “d” for searching “ð”. Or maybe “th”, I don’t know (according to this wikipedia page, “d” is a common transliteration).
So for this purpose, the UTR-30 list is not enough.
For latin-greek letters, it’s difficult to choose because I don’t know the logic behind these.
For danish words like Aarhus (which was spelled Århus until quite recently), of course if I were Danish and trying to search for a name like this, I would try “Aarhus”. But suppose I’m from the US and trying to search a guy whose name is “Århus”; I will probably search “Arhus”, no?

sanette · September 26, 2019, 6:07am

I think I will add an optional parameter to choose between preferring transliterations, like “th”, “aa”, etc. or a strict “char” version, using “d” (or “t”) and “a”, and so on.
(I actually already have a “uchar_to_char” function in the library that does this – in a basic way – for a Uchar character.)

Maelan · September 26, 2019, 7:37am

Exactly what I said!

I suppose you would be expected to type “Aarhus” since this is the international spelling of the city (this is how flight companies spell it). Suggesting “Aarhus” when “Arhus” is typed would be more a matter of spell corrections based on an edit distance. But you are right, ultimately this is a question of choices and I won’t bother you more.

sanette · September 26, 2019, 9:14am

it seems that theses are used only for the International Phonetic Alphabet

sanette · September 26, 2019, 7:39pm

I have implemented most of your suggestions

sagotch · September 30, 2019, 9:07am

I extracted code from GeneWeb in a standalone library: https://github.com/geneweb/unidecode

I just saw that in the meantime you created your own library as well. Interface will be very different, so I guess that they can leave side by side. I am trying to design unidecode to fit the need of GeneWeb, which mean that it would need to treat millions of words in batch and be as fast as possible.

If this constraint should disappear in the future, I might consider switching to your lib.

sanette · September 30, 2019, 9:28am

Great to see your library. In this matter (removing accents) there are many personal choices to make, so I’m sure it won’t behave exactly the same way as mine on some unusual utf chars. Yours is certainly better suited to you, since you designed it.
On the speed side, I’d be interested to test. My impression is that basechar should be quite fast, since it relies on uutf folding, and replacements are done via an integer Map.

sagotch · September 30, 2019, 9:38am

I plan to add benchmarks very soon, and I definitively add your library to my benchmark suit so we will be able to compare. I’ll keep you informed when it’s done.

Note that characters support of my lib is quite incomplete. For instance, no support for Vietnamese characters yet, since I need to extract a PR from GeneWeb which will handle it. As long as support will not be the same, benchmarks will be flawed.

sanette · October 11, 2019, 8:53am

I renamed the library “Ubase” (like “utf base strings”) because I noticed there are too many opam packages starting with “base”… .So it’s now there: https://github.com/sanette/ubase

sagotch · October 26, 2019, 3:03pm

I finally ran some benchmarks (using the same input as those found in your unit tests): https://github.com/geneweb/unidecode/blob/master/test/bench.ml

French
              Rate     Ubase Unidecode
    Ubase  75497/s        --      -57%
Unidecode 176846/s      134%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 144824/s        --      -64%
Unidecode 404758/s      179%        --

Unidecode appears to be faster than ubase on these inputs, but I do not know about other difference between the two libraries (languages support, memory footprint, …).

sanette · October 26, 2019, 5:46pm

that’s impressive; thanks for the tests

sanette · October 26, 2019, 7:13pm

Be careful that Unidecode doesn’t seem to work with NFD normalization:

# Unidecode.decode_string "Vũ Ngọc Phan";;
- : string = "Vu\204\131 Ngo\204\163c Phan"

sanette · October 27, 2019, 7:54am

Trying to see what made my code “slow”, I just changed Map.find to Map.find_opt and immediately gained 30%-50% performance! (now comparable with Unidecode: faster for French, slower for Vietnamese).
I guess I will think again now when using exceptions

With Map.find:

French
              Rate     Ubase Unidecode
    Ubase  65299/s        --      -57%
Unidecode 153583/s      135%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 123110/s        --      -65%
Unidecode 349022/s      184%        --

With Map.find_opt:

French
              Rate Unidecode     Ubase
Unidecode 153732/s        --      -19%
    Ubase 189538/s       23%        --

Vietnamese
              Rate     Ubase Unidecode
    Ubase 243077/s        --      -31%
Unidecode 354489/s       46%        --

sagotch · October 28, 2019, 11:14am

Intersting. Let see how both libs will evolve and how memory usage / binary size would vary, but I might consider switching to ubase then.

jaxon · October 29, 2020, 4:56pm

About Å, there are two of them.

A with a ring about (the ordinary “Å”), and the unit Ångström, which also looks like “Å” (they are actually the same letter, different use. Is that the same with “µ” ?)

sanette · October 29, 2020, 10:05pm

They are really the same letter. Wikipedia says “Unicode also has encoded U+212B Å ANGSTROM SIGN. However, that is canonically equivalent to the ordinary letter Å. The duplicate encoding at U+212B is due to round-trip mapping compatibility with an East-Asian character encoding, but is otherwise not to be used.”

mro · May 11, 2022, 9:19am

@steinuil , @dbuenzli is there such a way? I’d like to port the Go snippet

transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool {
	return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}), norm.NFC)

that does exactly that to Ocaml, preferably using uucp and friends.

dbuenzli · May 11, 2022, 9:40am

Two years later in a discussion of 37 messages, not sure about which way you are talking about.

I don’t think that should pose any problem.

mro · May 12, 2022, 6:32am

Thank you, I’ll bite through.

fccm · May 31, 2022, 3:50pm

The lynx text web-browser also makes these kind of character translations.
You could have a look at how it does this. You could just create an html page with the characters to translate and open it with lynx. You can also dump the result to a text file with links -dump.

Topic		Replies	Views
[ANN] Ubase 0.03 Community announce	16	1806	July 13, 2021
Status of UTF-8 support in shells Learning	4	662	October 11, 2021
[ANN] Confero 0.1.1 - Unicode Collation Community announce , unicode	0	644	November 19, 2022
[ANN] Unicode 15.1.0 update for Uucd, Uucp, Uunf and Uuseg Community announce , ocsf	0	439	September 15, 2023
How to access the module Uutf.String.UTF_8 Learning	23	4556	March 28, 2018

Simplify roman utf8

Related topics