[ANN] Bilingual word equivalences

deca3 · November 22, 2023, 8:08pm

Hi,
This message to highlight the “bilingual-eq” website, an attempt to create an equivalent of the website linguee . com
http://decapode314.free.fr/bilingual-eq/

This website has been built from the free contents of the wiktionary and wikisource.

I extract words equivalences from the wiktionary dumps:
https : // dumps. wikimedia. org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2

and I provide the context with bilingual books that I have created from public domain books found on wikisource.

The link above is version 2, it contains 5000 word equivalences from 5 bilingual books, but version 3 contains 7000 word equivalences made from 9 bilingual books.

I lost the source code of version 3 in a computer crash, but I still have the backup of version 2, and now I have maybe something like 15 bilingual books to feed the script.

But even 7000 word equivalences, this is only a very small percent of what linguee provides online.
If I’m not mistaken they seem to provide 2 million, for French-English. So if I could reach 20000 it would be only 1% of what linguee provides for French and English.

It needed something like 3 days to generate version 3.

The ocaml scripts that generated this website are provided on the “about” page.

Any comments are wellcome.

yawaramin · November 23, 2023, 5:17am

Is this something to do with OCaml?

Khady · November 23, 2023, 7:56am

Yes, the code is mentioned in Bilingual Word Equivalence

deca3 · November 23, 2023, 2:33pm

Yes, all the source code to generate this project are written with OCaml.

Also the final script requires several days to generate the result, I would accept any advice to make it shorter.

otini · November 24, 2023, 11:20am

Regarding the final script, not sure what

Strings.replace p2 (pat word)

does (the generation of multiple patterns with or without <b> or \n seems wasteful), but it seems that for each word of your multilingual dictionary, you search for it across your entire set of books. This is O(n²) where n is the number of words in the books. How about an initial step where you go through the books linearly, and construct a table mapping words to the indices of paragraphs where they can be found? This would make the algorithm O(n) with suitable data structures.

deca3 · November 27, 2023, 11:52pm

http://decapode314.free.fr/ocaml/bilingual-eq/strings.ml
it replaces a string by another one in p2, and if no replacement were made it returns None.
so if it returns Some it means there is a match for the string word.

This function is probably not very efficient but I don’t know about a library in ocaml for a pure string match and replacement that is not a regular expression.

It’s indeed O(n²) where n is both the number of words in the dictionary (more than 50_000) and the number of paragraphs in all the bilingual books.

I will try your idea of indices to see if it makes a difference, but I don’t think that it will magically make the script O(n).

This idea made me think of this task on rosettacode: https://rosettacode.org/wiki/Word_frequency#OCaml
I can probably reuse it here.

otini · December 5, 2023, 2:22pm

No magic, just combinatorics. If the table mapping words to paragraph indices is a hash table, adding and searching it is amortized O(1). Then constructing the mapping, and using it to generate your website, are roughly linear operations. Probably O(n log n) to account for sorting at some point. In any case much better than quadratic for big values of n.

Topic		Replies	Views
Can performance be improved for this code? Learning	30	1563	May 12, 2022
Ocaml Bytecode performance? Learning	21	2591	October 11, 2023
Significant performance difference between OCaml and F# Ecosystem	53	18966	July 9, 2022
Suggestions for ocaml documentation Community documentation , ocamlorg	84	7157	July 16, 2020
Feedback / Help Wanted: Upcoming OCaml.org Cookbook Feature Community user-feedback , ocamlorg	36	2694	January 5, 2025

[ANN] Bilingual word equivalences

Related topics