Simplify roman utf8

Hello

is there a library (or a simple way) to “remove accents” from a string containing an UTF8 encoded word from a language using the roman alphabet + diacritics?
I mean; removing diacritics to obtain ascii.
For instance: “é” -> “e”, “ø” -> “o” (or “oe”) , “ọ” -> “o”, “ñ” -> “n”, etc.

thanks

I’m sure it can be done with Uucd, but I wouldn’t mind a hint.
In the meantime I wrote a script to (brutally) extract an association list from
http://www.fileformat.info/info/charset/UTF-8/list.htm

Here it is

[("À", "A");
("Á", "A");
("Â", "A");
("Ã", "A");
("Ä", "A");
("Å", "A");
("Æ", "AE");
("Ç", "C");
("È", "E");
("É", "E");
("Ê", "E");
("Ë", "E");
("Ì", "I");
("Í", "I");
("Î", "I");
("Ï", "I");
("Ð", "ETH");
("Ñ", "N");
("Ò", "O");
("Ó", "O");
("Ô", "O");
("Õ", "O");
("Ö", "O");
("Ø", "O");
("Ù", "U");
("Ú", "U");
("Û", "U");
("Ü", "U");
("Ý", "Y");
("Þ", "THORN");
("ß", "s");
("à", "a");
("á", "a");
("â", "a");
("ã", "a");
("ä", "a");
("å", "a");
("æ", "ae");
("ç", "c");
("è", "e");
("é", "e");
("ê", "e");
("ë", "e");
("ì", "i");
("í", "i");
("î", "i");
("ï", "i");
("ð", "eth");
("ñ", "n");
("ò", "o");
("ó", "o");
("ô", "o");
("õ", "o");
("ö", "o");
("ø", "o");
("ù", "u");
("ú", "u");
("û", "u");
("ü", "u");
("ý", "y");
("þ", "thorn");
("ÿ", "y");
("Ā", "A");
("ā", "a");
("Ă", "A");
("ă", "a");
("Ą", "A");
("ą", "a");
("Ć", "C");
("ć", "c");
("Ĉ", "C");
("ĉ", "c");
("Ċ", "C");
("ċ", "c");
("Č", "C");
("č", "c");
("Ď", "D");
("ď", "d");
("Đ", "D");
("đ", "d");
("Ē", "E");
("ē", "e");
("Ĕ", "E");
("ĕ", "e");
("Ė", "E");
("ė", "e");
("Ę", "E");
("ę", "e");
("Ě", "E");
("ě", "e");
("Ĝ", "G");
("ĝ", "g");
("Ğ", "G");
("ğ", "g");
("Ġ", "G");
("ġ", "g");
("Ģ", "G");
("ģ", "g");
("Ĥ", "H");
("ĥ", "h");
("Ħ", "H");
("ħ", "h");
("Ĩ", "I");
("ĩ", "i");
("Ī", "I");
("ī", "i");
("Ĭ", "I");
("ĭ", "i");
("Į", "I");
("į", "i");
("İ", "I");
("ı", "i");
("Ĵ", "J");
("ĵ", "j");
("Ķ", "K");
("ķ", "k");
("ĸ", "kra");
("Ĺ", "L");
("ĺ", "l");
("Ļ", "L");
("ļ", "l");
("Ľ", "L");
("ľ", "l");
("Ŀ", "L");
("ŀ", "l");
("Ł", "L");
("ł", "l");
("Ń", "N");
("ń", "n");
("Ņ", "N");
("ņ", "n");
("Ň", "N");
("ň", "n");
("ʼn", "n");
("Ŋ", "ENG");
("ŋ", "eng");
("Ō", "O");
("ō", "o");
("Ŏ", "O");
("ŏ", "o");
("Ő", "O");
("ő", "o");
("Ŕ", "R");
("ŕ", "r");
("Ŗ", "R");
("ŗ", "r");
("Ř", "R");
("ř", "r");
("Ś", "S");
("ś", "s");
("Ŝ", "S");
("ŝ", "s");
("Ş", "S");
("ş", "s");
("Š", "S");
("š", "s");
("Ţ", "T");
("ţ", "t");
("Ť", "T");
("ť", "t");
("Ŧ", "T");
("ŧ", "t");
("Ũ", "U");
("ũ", "u");
("Ū", "U");
("ū", "u");
("Ŭ", "U");
("ŭ", "u");
("Ů", "U");
("ů", "u");
("Ű", "U");
("ű", "u");
("Ų", "U");
("ų", "u");
("Ŵ", "W");
("ŵ", "w");
("Ŷ", "Y");
("ŷ", "y");
("Ÿ", "Y");
("Ź", "Z");
("ź", "z");
("Ż", "Z");
("ż", "z");
("Ž", "Z");
("ž", "z");
("ſ", "s");
("ƀ", "b");
("Ɓ", "B");
("Ƃ", "B");
("ƃ", "b");
("Ƅ", "SIX");
("ƅ", "six");
("Ɔ", "O");
("Ƈ", "C");
("ƈ", "c");
("Ɖ", "D");
("Ɗ", "D");
("Ƌ", "D");
("ƌ", "d");
("ƍ", "delta");
("Ǝ", "E");
("Ə", "SCHWA");
("Ɛ", "E");
("Ƒ", "F");
("ƒ", "f");
("Ɠ", "G");
("Ɣ", "GAMMA");
("ƕ", "hv");
("Ɩ", "IOTA");
("Ɨ", "I");
("Ƙ", "K");
("ƙ", "k");
("ƚ", "l");
("ƛ", "lambda");
("Ɯ", "M");
("Ɲ", "N");
("ƞ", "n");
("Ɵ", "O");
("Ơ", "O");
("ơ", "o");
("Ƣ", "OI");
("ƣ", "oi");
("Ƥ", "P");
("ƥ", "p");
("Ƨ", "TWO");
("ƨ", "two");
("Ʃ", "ESH");
("ƫ", "t");
("Ƭ", "T");
("ƭ", "t");
("Ʈ", "T");
("Ư", "U");
("ư", "u");
("Ʊ", "UPSILON");
("Ʋ", "V");
("Ƴ", "Y");
("ƴ", "y");
("Ƶ", "Z");
("ƶ", "z");
("Ʒ", "EZH");
("Ƹ", "EZH");
("ƹ", "ezh");
("ƺ", "ezh");
("Ƽ", "FIVE");
("ƽ", "five");
("DŽ", "DZ");
("Dž", "D");
("dž", "dz");
("LJ", "LJ");
("Lj", "L");
("lj", "lj");
("NJ", "NJ");
("Nj", "N");
("nj", "nj");
("Ǎ", "A");
("ǎ", "a");
("Ǐ", "I");
("ǐ", "i");
("Ǒ", "O");
("ǒ", "o");
("Ǔ", "U");
("ǔ", "u");
("Ǖ", "U");
("ǖ", "u");
("Ǘ", "U");
("ǘ", "u");
("Ǚ", "U");
("ǚ", "u");
("Ǜ", "U");
("ǜ", "u");
("ǝ", "e");
("Ǟ", "A");
("ǟ", "a");
("Ǡ", "A");
("ǡ", "a");
("Ǣ", "AE");
("ǣ", "ae");
("Ǥ", "G");
("ǥ", "g");
("Ǧ", "G");
("ǧ", "g");
("Ǩ", "K");
("ǩ", "k");
("Ǫ", "O");
("ǫ", "o");
("Ǭ", "O");
("ǭ", "o");
("Ǯ", "EZH");
("ǯ", "ezh");
("ǰ", "j");
("DZ", "DZ");
("Dz", "D");
("dz", "dz");
("Ǵ", "G");
("ǵ", "g");
("Ƕ", "HWAIR");
("Ƿ", "WYNN");
("Ǹ", "N");
("ǹ", "n");
("Ǻ", "A");
("ǻ", "a");
("Ǽ", "AE");
("ǽ", "ae");
("Ǿ", "O");
("ǿ", "o");
("Ȁ", "A");
("ȁ", "a");
("Ȃ", "A");
("ȃ", "a");
("Ȅ", "E");
("ȅ", "e");
("Ȇ", "E");
("ȇ", "e");
("Ȉ", "I");
("ȉ", "i");
("Ȋ", "I");
("ȋ", "i");
("Ȍ", "O");
("ȍ", "o");
("Ȏ", "O");
("ȏ", "o");
("Ȑ", "R");
("ȑ", "r");
("Ȓ", "R");
("ȓ", "r");
("Ȕ", "U");
("ȕ", "u");
("Ȗ", "U");
("ȗ", "u");
("Ș", "S");
("ș", "s");
("Ț", "T");
("ț", "t");
("Ȝ", "YOGH");
("ȝ", "yogh");
("Ȟ", "H");
("ȟ", "h");
("Ƞ", "N");
("ȡ", "d");
("Ȣ", "OU");
("ȣ", "ou");
("Ȥ", "Z");
("ȥ", "z");
("Ȧ", "A");
("ȧ", "a");
("Ȩ", "E");
("ȩ", "e");
("Ȫ", "O");
("ȫ", "o");
("Ȭ", "O");
("ȭ", "o");
("Ȯ", "O");
("ȯ", "o");
("Ȱ", "O");
("ȱ", "o");
("Ȳ", "Y");
("ȳ", "y");
("ȴ", "l");
("ȵ", "n");
("ȶ", "t");
("ȷ", "j");
("ȸ", "db");
("ȹ", "qp");
("Ⱥ", "A");
("Ȼ", "C");
("ȼ", "c");
("Ƚ", "L");
("Ⱦ", "T");
("ȿ", "s");
("ɀ", "z");
("Ɂ", "STOP");
("ɂ", "stop");
("Ƀ", "B");
("Ʉ", "U");
("Ʌ", "V");
("Ɇ", "E");
("ɇ", "e");
("Ɉ", "J");
("ɉ", "j");
("Ɋ", "Q");
("ɋ", "q");
("Ɍ", "R");
("ɍ", "r");
("Ɏ", "Y");
("ɏ", "y");
("ɐ", "a");
("ɑ", "alpha");
("ɒ", "alpha");
("ɓ", "b");
("ɔ", "o");
("ɕ", "c");
("ɖ", "d");
("ɗ", "d");
("ɘ", "e");
("ə", "schwa");
("ɚ", "schwa");
("ɛ", "e");
("ɜ", "e");
("ɝ", "e");
("ɞ", "e");
("ɟ", "j");
("ɠ", "g");
("ɡ", "script");
("ɣ", "gamma");
("ɤ", "rams");
("ɥ", "h");
("ɦ", "h");
("ɧ", "heng");
("ɨ", "i");
("ɩ", "iota");
("ɫ", "l");
("ɬ", "l");
("ɭ", "l");
("ɮ", "lezh");
("ɯ", "m");
("ɰ", "m");
("ɱ", "m");
("ɲ", "n");
("ɳ", "n");
("ɵ", "barred");
("ɷ", "omega");
("ɸ", "phi");
("ɹ", "r");
("ɺ", "r");
("ɻ", "r");
("ɼ", "r");
("ɽ", "r");
("ɾ", "r");
("ɿ", "r");
("ʂ", "s");
("ʃ", "esh");
("ʄ", "j");
("ʅ", "squat");
("ʆ", "esh");
("ʇ", "t");
("ʈ", "t");
("ʉ", "u");
("ʊ", "upsilon");
("ʋ", "v");
("ʌ", "v");
("ʍ", "w");
("ʎ", "y");
("ʐ", "z");
("ʑ", "z");
("ʒ", "ezh");
("ʓ", "ezh");
("ʚ", "e");
("ʞ", "k");
("ʠ", "q");
("ʣ", "dz");
("ʤ", "dezh");
("ʥ", "dz");
("ʦ", "ts");
("ʧ", "tesh");
("ʨ", "tc");
("ʩ", "feng");
("ʪ", "ls");
("ʫ", "lz");
("ʮ", "h");
("ʯ", "h")]

Note that if you want to apply that list you need to make sure your keys in this list and your data are in the same known Unicode normal form otherwise this is going to brutally fail.

It really depends on what your are doing exactly but one way to go about this would be to convert the string to a decomposed normal form (likely NFKD if you want to get rid of the ligatures like fi) and filter out the scalar values for which Uucp.Func.is_diacritic is true. However note that this wouldn’t work for some of the things that are in your list (e.g. Đ decomposes to itself, see ucharinfo U+0110 if you have uucp installed).

1 Like

I have a list of names, and I would like to search it in a case-insensitive, diacritic-insensitive (other-insensitive?) way.

I’m sure it’s a universal problem and must be done somewhere, but I couldn’t find it (at least from a quick googleing. But, for instance the search engine “Qwant” clearly knows how to do it: if you type “Bøǹĵöůɍ” in the search bar, the search suggestions appearing are “bonjour madame”, “bonjour tout le monde”, etc. Google is more strict: it works with “Bønĵöůr” but not with “Bøǹĵöůɍ”. )

From searching I found this example in C#, which normalizes to Form D and then uses UnicodeCategory.NonSpacingMark, which maps to the general category Mn, to filter diacritics.

It’s a universal problem whose answer highly depends on your context (see precision and recall) so you are unlikely to find a definitive answer to it.

You might consider converting your queries and searched atoms to DUCET collation keys and do the comparison on that using level 1. And see if that satisfies your needs (a look at the definition of the default collation table seems to indicate that level 1 sorts for example U+0110 and U+0044 as the same). Camomile has support for collation (for an old version of Unicode something like 3 or 4).

Other than that browsing a bit lucene mentions UTR 30 which apparently never made it to a Unicode standard but has this data file.

3 Likes

@steinuil that won’t handle “Đ” -> “D”:

# Uunf_string.normalize_utf_8 `NFD "Đ" = "Đ";;
- : bool = true

great find, thanks !

finally I ended up using only Uucp.Name.name and Uucp.Script.script + some ah-hoc filtering of the name. It’s maybe not very elegant, but I think I obtain a list that is more complete (for latin only) than the one in the UTF 30 data file.
[and I use NFC normalization]

I just have a remark as a user (it might not even apply to your case): most of the time I find it annoying when search boxes don’t behave that way, like you got the accent wrong (very common when typing French to mix up é and è for instance) and just because of that there is no match. Typically my contacts list on my phone does this and I hate it. Or you enter the thing correctly, but the database only knows about the ASCII version (typically, place names).

However, it has happened to me a few times that a search box did this and I didn’t want it to, because I entered the thing correctly and if you change the diacritics you get a super common word. Because of that I think the proper behaviour would be to prioritize exact matches and only then search for the variant without diacritics. Google gets this right for instance, searching for “thé” and “the” get you different results.

this is very sensible remark. It should not be too difficult to implement a ‘distance’ function that increases with the number of substituted diacritics/accents.

And indeed with this respect google is better than qwant.

I have a list of names, and I would like to search it in a case-insensitive, diacritic-insensitive (other-insensitive?) way.

This is what is expected in geneweb as well.

See lib/util/name.ml#L45 and lib/util/name.ml#L438

Also see this PR https://github.com/geneweb/geneweb/pull/763 as well which improve support for cyrillic characters.

I plan to extract this part of geneweb and release it as a standalone library after some refactoring.

This is hand-written code (not mine) and it definitly lack a lot of cases, but I expect it to be pretty fast in comparison with an other utf8 lib.

Hi! thanks for the links. If you could make a standalone library, I would definitely give it a try

OK, so I have written my own library, because it was fun. But the best list online is, I think:

They don’t explain how they build their list. I suspect it’s partly by hand.

If you are interested, I have released the library on github:

PR welcome!

1 Like

Thanks for this tiny but helpful library. I have several questions and suggestions.

  1. Can’t you rely more on Unicode normal forms for simplification? For example, why not use a decomposed normal form and then simply filter out the composing codepoints? [After re-reading I realize this is was @dbuenzli and @steinuil were suggesting you.]

  2. In your list automatically extracted from fileformat.info, you have conversions for letters which are not latin letters with diacritics, for example:

    "ð/Ð" → "eth/ETH" (Icelandic)
    "α" → "alpha"     (Greek)
    "ƛ" → "lambda"    (phonetic transcription of some Amerindian languages)
    

    Those are not transliterations but the names of the base character. I believe this does not make much sense. If your goal is simplifying strings for stuff-insensitive comparisons, you probably want Greek letters to remain Greek letters, Cyrillic letters to remain Cyrillic letters and so on. The UTR-30 list does the right thing about this. If your goal is to compile to ASCII, maybe a more sensible output would be e.g. "Raðljóst" → "Ra{eth}ljost".

    (Now In the case of Scandinavian scripts, you should ask an Icelandic/Danish/Norwegian speaker, but I believe the following latin transliterations might be acceptable:)

    "δ/Ð" → "dh/Dh"
    "þ/Þ" → "th/Th"
    "ø/Ø" → "o/O" (already in your list)
    "å/Å" → "aa/Aa" (in your list you transliterate it to a/A)
    
  3. Since you cared about handling typographic characters such as apostrophes, you’ll probably want to transliterate non-breaking spaces (U+A0, U+202F at least) to simple spaces. And, why not, various types of dashes (U+2010 the “true” hyphen, U+2212 the “true” minus sign, U+2013 the en‐dash, U+2014 the em‐dash…). Oh, and what about U+2026 the ellipsis? :wink:

  4. Some applications (such as IRC clients) support a mixed mode where bytes which do not form valid UTF-8 codepoints are re-interpreted as Windows-1252 (superset of latin-1) codes. This is to accommodate broken input coming from other applications or users. Now I have mixed feelings about this mixed mode, as it hides problems, but would it be possible to implement this behavior in your library? Does Uutf gives such flexibility?

  5. Moving your very long list to a separate file would make reading and editing the source code much easier.

hi @Maelan, thanks a lot for your suggestions, I think they are very useful, and will try to implement them when I have the time. Remember that PR are welcome.

Some quick comments for tonight:

  1. About using a decomposed normal form, some libraries do this, but as dbuenzli remarked, it will not work for Đ (U+0110) — which is NOT the same as “Ð = eth” = U+00d0 (even though they are graphically the same). Maybe the same occurs for other letters, I have not checked, but because of this I have decided not to follow this.

  2. The important thing is that the map is generated automatically, obviously because it’s faster, but also because it’s more fun to do. Of course a manual map would certainly be more accurate (and probably I should concede manual modifications if time permits or if I get some help). So I have relied on the “name” field of the letters, which is why “eth” occurs. Concerning “alpha”, it’s more complicated than what you present. In fact this alpha U+0251 IS a latin letter alpha (whatever it means!) : its official name is LATIN SMALL LETTER ALPHA. The greek alpha is U+03B1. It appears like this is my map list due to the automatic extraction that I do, based on the LATIN keyword. You can notice that most other greek letters don’t occur (but there are some of them. I don’t have an explanation for this, if someone can share I’d be happy to know).

For “δ/Д → “dh/Dh”
“þ/Þ” → “th/Th” “å/Å” → “aa/Aa” I thank you for this, I will modify this.

  1. thank you for the suggestions! please PR, I have a special “manual list” in my code which is meant for these.

  2. I don’t know right now, I have to investigate this.

  3. good idea, that was actually my plan too at some point.

no, the devil is in the details. This UTR-30 list has Đ (U+0110) = D = LATIN CAPITAL LETTER D WITH STROKE
but it does not have Ð (U+00d0) = LATIN CAPITAL LETTER ETH

Ah my bad, I hadn’t spotted that there was a “latin letter D with stroke”, a “latin letter alpha” and so on.

But isn’t it what we want? In my understanding, U+0110 (LATIN CAPITAL LETTER D WITH STROKE) indeed is the latin letter D with a diacritic, buth U+00D0 (LATIN CAPITAL LETTER ETH) is a letter per se, the letter Ð, which (I guess) is not regarded by Icelandic orthography as a decorated variant of another letter. An Icelandic-speaking person could tell us whether this letter should be simplified to D for searching (is substituting D for Ð a common misspelling?) or collating.

The same may apply to the latin duplicates of the greek letters, like this latin alpha (whatever that means, as you said) or this latin lambda, and I guess that’s why they don’t appear either in the UTR-30 dataset.

I didn’t PR you because I had several unrelated remarks anyway. Maybe I will at some point.

There was an error in my tentative transliteration of Scandinavian letters: according to Wikipedia, ð is indeed transliterated to “d” but þ is transliterated to “th”. As for Danish, I am pretty sure that Danes use “å” and “aa” interchangeably, at least in an international context (see Aarhus).

according to Wikipedia, ð is indeed transliterated to “th” but þ is transliterated to “d”.

umm, according to this wikipedia link, ð is transliterated to “d” and þ to “th”!