[ANN] Ubase 0.03

I’m happy to announce the release of ubase, a tiny library whose only purpose is to remove diacritics (accents, etc.) from utf8-encoded strings using the latin alphabet.

It was created after the discussion: Simplify roman utf8.

It’s now available from opam:

opam install ubase

This also installs an executable that you may use in a shell, for instance:

$ ubase "et grønt træ"
et gront trae

$ ubase Anh xin lỗi các em bé vì đã đề tặng cuốn sách này cho một ông người lớn.
Anh xin loi cac em be vi da de tang cuon sach nay cho mot ong nguoi lon.

More info here.

10 Likes

Why do you want to do that? I am curious.

@UnixJunkie : quoting from the docs:

Please don’t use this library to store your strings without
accents! On the contrary, store them in full UTF8 encoding, and use
this library to simplify searching and comparison.

Indeed, removing diacritics is very useful for searching in databases. In my case, I am using this to search in lists of names. (Using https://github.com/sanette/ufind .)

Most search engines like google implement a similar thing, this is evident for instance when you type accented words in the search bar and look at the autocomplete suggestions.

3 Likes

I’m curious why “et grønt træ” becomes “et gront trae” and not “et groent trae” or “et gront tra”

1 Like

hi @reynir, it’s because of the way the letters are described in the official unicode data base. Here, ø is described as “LATIN SMALL LETTER O WITH STROKE”,
while æ is “LATIN SMALL LETTER AE”, so it’s considered a full letter on its own.

The primary goal of the library being to remove diacritics, when something like “LATIN SMALL LETTER O WITH STROKE” is encountered, it’s clear that it should output “o”.

Another goal of the library being to obtain ASCII approximations, for the special cases of “letters with a name”, I realised that a (more or less) “educated” choice has to be made, and there is a special list of them in the script that generates the library. For instance I have chosen to replace ð (“LATIN SMALL LETTER ETH”) by “d” instead its name “ETH”. For “æ”, since “ae” is commonly found in transliterations, I have chosen to use it. Of course this part is debatable. I’m really not a linguist, and I tried to decide by reading the wikipedia pages of these letters, so any help is welcome.

But replacing ø with oe is as common in Danish as replacing ü with ue in German. Replacing them with the base letter is essentially unhelpful and might yield different words with different meanings in the worst case.

So some additional approximations need to be added like ø -> oe, å -> aa, ü -> ue, ä -> ae, ö -> oe, ß -> ss (which I would believe is a common asciification in Danish, Norwegian, Swedish and German).

it’s not correct that replacing ö with o is not helpful. If you type “sjostrand” in google you will find Sjöstrand, and we need to be aware of this when writing search engines. When you encounter a name in a language you don’t know, and try to search it in a list, it makes sense to just remove the diacritics. That’s what ubase was written for (note ‘base’ stands for “base letter”). It’s only when no base letter is clear that we need a sensible choice (btw: ß -> ss is already there). This philosophy is somewhat similar to what can be found in other (computer) languages, eg: https://metacpan.org/pod/Text::Unaccent::PurePerl

I agree that replacing “ö” by “oe” is also useful, it’s easy to add these as custom rules in the library if you wish, but it’s a different goal, and then we are entering an endless (or at least quite long :wink: ) list of possible transliterations: when do we stop? what if different languages have different transliterations? (Maybe this should be the role of languages plugins; but it won’t work when you’re facing a list of names of many origins). In fact even some transliteration libraries won’t do this, see https://metacpan.org/pod/Text::Unidecode

The Wikipedia page on Ø says that the transliteration ‘oe’ is correct but that ‘o’ is common

In other languages that do not have the letter as part of the regular alphabet, or in limited character sets such as ASCII, ø may correctly be replaced with the digraph “oe”, although in practice it is often replaced with just an “o”, e.g. in email addresses.

@Leonidas has a point that ‘o’ is not a good transliteration. It can easily change the meaning of the word, for example “mør” meaning tender becomes “mor” meaning mom. It’s not difficult to think of more examples.

Interestingly, it seems ubase doesn’t handle œ currently.

thanks for reporting! indeed “ɶ” is recognized but not “Œ”… I think I know why…

Maybe it was not clear in my previous answer that the main goal of the library is to remove accents, not to transliterate.

Perhaps the real goal is to facilitate a fuzzy comparison. In that case, both “oe” and “o” should be considered as equivalent to “ø” and “ö”, no? Does this library help with both cases?

you’re right, that’s an important goal.
In fact, fuzzy searching is not implemented directly in ubase but in ufind
(I haven’t made an official announcement for it because I find it somehow too basic right now, although I’m already using it for my job)
ufind will easily compare ‘ø’ and ‘ö’ to ‘o’ because it’s just “removing an accent away”, which is precisely what ubase does. As you pointed out, the downside is that (with the current ufind) if you type “oe” to search a word that contains ‘ø’ or ‘ö’, it will not work very well. Of course I plan to improve this, but currently I find it easier to tell users that when searching for names, they just have to type them “with accents removed”.

Since you mentioned Unicode, are you using (or planning to use) something like this as a guide? https://www.w3.org/TR/charmod-norm/

1 Like

I’m using mainly this http://www.unicode.org/versions/Unicode13.0.0/
in particular concerning unicode normalization and casefolding.
It seems to cover the document you refer to but thanks for the link, it’s probably easier to read.

1 Like

Yeah, I think it’s more specific to your purpose :slight_smile:

1 Like

this is now corrected in ubase 0.04. Thanks @reynir again for reporting.