Hi,
This message to highlight the “bilingual-eq” website, an attempt to create an equivalent of the website linguee . com http://decapode314.free.fr/bilingual-eq/
This website has been built from the free contents of the wiktionary and wikisource.
I extract words equivalences from the wiktionary dumps:
https : // dumps. wikimedia. org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2
and I provide the context with bilingual books that I have created from public domain books found on wikisource.
The link above is version 2, it contains 5000 word equivalences from 5 bilingual books, but version 3 contains 7000 word equivalences made from 9 bilingual books.
I lost the source code of version 3 in a computer crash, but I still have the backup of version 2, and now I have maybe something like 15 bilingual books to feed the script.
But even 7000 word equivalences, this is only a very small percent of what linguee provides online.
If I’m not mistaken they seem to provide 2 million, for French-English. So if I could reach 20000 it would be only 1% of what linguee provides for French and English.
It needed something like 3 days to generate version 3.
The ocaml scripts that generated this website are provided on the “about” page.
does (the generation of multiple patterns with or without <b> or \n seems wasteful), but it seems that for each word of your multilingual dictionary, you search for it across your entire set of books. This is O(n²) where n is the number of words in the books. How about an initial step where you go through the books linearly, and construct a table mapping words to the indices of paragraphs where they can be found? This would make the algorithm O(n) with suitable data structures.
This function is probably not very efficient but I don’t know about a library in ocaml for a pure string match and replacement that is not a regular expression.
It’s indeed O(n²) where n is both the number of words in the dictionary (more than 50_000) and the number of paragraphs in all the bilingual books.
I will try your idea of indices to see if it makes a difference, but I don’t think that it will magically make the script O(n).
No magic, just combinatorics. If the table mapping words to paragraph indices is a hash table, adding and searching it is amortized O(1). Then constructing the mapping, and using it to generate your website, are roughly linear operations. Probably O(n log n) to account for sorting at some point. In any case much better than quadratic for big values of n.