[ANN] Ubase 0.03

I’m happy to announce the release of ubase, a tiny library whose only purpose is to remove diacritics (accents, etc.) from utf8-encoded strings using the latin alphabet.

It was created after the discussion: Simplify roman utf8.

It’s now available from opam:

opam install ubase

This also installs an executable that you may use in a shell, for instance:

$ ubase "et grønt træ"
et gront trae

$ ubase Anh xin lỗi các em bé vì đã đề tặng cuốn sách này cho một ông người lớn.
Anh xin loi cac em be vi da de tang cuon sach nay cho mot ong nguoi lon.

More info here.

10 Likes

Why do you want to do that? I am curious.

@UnixJunkie : quoting from the docs:

Please don’t use this library to store your strings without
accents! On the contrary, store them in full UTF8 encoding, and use
this library to simplify searching and comparison.

Indeed, removing diacritics is very useful for searching in databases. In my case, I am using this to search in lists of names. (Using https://github.com/sanette/ufind .)

Most search engines like google implement a similar thing, this is evident for instance when you type accented words in the search bar and look at the autocomplete suggestions.

2 Likes

I’m curious why “et grønt træ” becomes “et gront trae” and not “et groent trae” or “et gront tra”

hi @reynir, it’s because of the way the letters are described in the official unicode data base. Here, ø is described as “LATIN SMALL LETTER O WITH STROKE”,
while æ is “LATIN SMALL LETTER AE”, so it’s considered a full letter on its own.

The primary goal of the library being to remove diacritics, when something like “LATIN SMALL LETTER O WITH STROKE” is encountered, it’s clear that it should output “o”.

Another goal of the library being to obtain ASCII approximations, for the special cases of “letters with a name”, I realised that a (more or less) “educated” choice has to be made, and there is a special list of them in the script that generates the library. For instance I have chosen to replace ð (“LATIN SMALL LETTER ETH”) by “d” instead its name “ETH”. For “æ”, since “ae” is commonly found in transliterations, I have chosen to use it. Of course this part is debatable. I’m really not a linguist, and I tried to decide by reading the wikipedia pages of these letters, so any help is welcome.

But replacing ø with oe is as common in Danish as replacing ü with ue in German. Replacing them with the base letter is essentially unhelpful and might yield different words with different meanings in the worst case.

So some additional approximations need to be added like ø -> oe, å -> aa, ü -> ue, ä -> ae, ö -> oe, ß -> ss (which I would believe is a common asciification in Danish, Norwegian, Swedish and German).

it’s not correct that replacing ö with o is not helpful. If you type “sjostrand” in google you will find Sjöstrand, and we need to be aware of this when writing search engines. When you encounter a name in a language you don’t know, and try to search it in a list, it makes sense to just remove the diacritics. That’s what ubase was written for (note ‘base’ stands for “base letter”). It’s only when no base letter is clear that we need a sensible choice (btw: ß -> ss is already there). This philosophy is somewhat similar to what can be found in other (computer) languages, eg: https://metacpan.org/pod/Text::Unaccent::PurePerl

I agree that replacing “ö” by “oe” is also useful, it’s easy to add these as custom rules in the library if you wish, but it’s a different goal, and then we are entering an endless (or at least quite long :wink: ) list of possible transliterations: when do we stop? what if different languages have different transliterations? (Maybe this should be the role of languages plugins; but it won’t work when you’re facing a list of names of many origins). In fact even some transliteration libraries won’t do this, see https://metacpan.org/pod/Text::Unidecode