[ANN] Bilingual word equivalences

This message to highlight the “bilingual-eq” website, an attempt to create an equivalent of the website linguee . com

This website has been built from the free contents of the wiktionary and wikisource.

I extract words equivalences from the wiktionary dumps:
https : // dumps. wikimedia. org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2

and I provide the context with bilingual books that I have created from public domain books found on wikisource.

The link above is version 2, it contains 5000 word equivalences from 5 bilingual books, but version 3 contains 7000 word equivalences made from 9 bilingual books.

I lost the source code of version 3 in a computer crash, but I still have the backup of version 2, and now I have maybe something like 15 bilingual books to feed the script.

But even 7000 word equivalences, this is only a very small percent of what linguee provides online.
If I’m not mistaken they seem to provide 2 million, for French-English. So if I could reach 20000 it would be only 1% of what linguee provides for French and English.

It needed something like 3 days to generate version 3.

The ocaml scripts that generated this website are provided on the “about” page.

Any comments are wellcome.


Is this something to do with OCaml?

Yes, the code is mentioned in Bilingual Word Equivalence

Yes, all the source code to generate this project are written with OCaml.

Also the final script requires several days to generate the result, I would accept any advice to make it shorter.

Regarding the final script, not sure what

Strings.replace p2 (pat word)

does (the generation of multiple patterns with or without <b> or \n seems wasteful), but it seems that for each word of your multilingual dictionary, you search for it across your entire set of books. This is O(n²) where n is the number of words in the books. How about an initial step where you go through the books linearly, and construct a table mapping words to the indices of paragraphs where they can be found? This would make the algorithm O(n) with suitable data structures.

it replaces a string by another one in p2, and if no replacement were made it returns None.
so if it returns Some it means there is a match for the string word.

This function is probably not very efficient but I don’t know about a library in ocaml for a pure string match and replacement that is not a regular expression.

It’s indeed O(n²) where n is both the number of words in the dictionary (more than 50_000) and the number of paragraphs in all the bilingual books.

I will try your idea of indices to see if it makes a difference, but I don’t think that it will magically make the script O(n).

This idea made me think of this task on rosettacode: https://rosettacode.org/wiki/Word_frequency#OCaml
I can probably reuse it here.

No magic, just combinatorics. If the table mapping words to paragraph indices is a hash table, adding and searching it is amortized O(1). Then constructing the mapping, and using it to generate your website, are roughly linear operations. Probably O(n log n) to account for sorting at some point. In any case much better than quadratic for big values of n.