given two sparse integer vectors A and B, we want to compute:
J_w(A,B) = |AnB| / (|A| + |B| - |AnB|)
With |A| being sum(a_i); i.e. the sum of all feature counts.
A feature is an index in the sparse vector.
What I call feature count is the integer value at that index.
All feature counts are integers, and >= 0.
|AnB| = sum(min(a_i, b_i)) over all i.
By the way, this other formula is also valid:
J_w(A,B) = sum(min(a_i, b_i)) / sum(max(a_i, b_i)); over all i.
I am open to all propositions. Let’s say that we are free to choose whichever datastructure/memory representation will make the computation faster.
Currently, I have a program which spends ~66% of its time doing J_w evaluations.
So, I am very interested if this could go faster.
I already use a bisector tree to prune the search space.
But in the end, I need to compute J_w for all vectors returned by a query.
Um, this is very off-the-cuff, but it should be a decent improvement, to switch to type t = { ks: int array; vs : int array }. I could be wildly off-base, but at least in my experience, whenever you have heavy read-operations that iterate down a list, it can pay to try to keep it in the form of an array.
Once you’ve switched to an array-based representation, it should be possible to use SMP/multicore parallelism to speed things up, by implementing the core operation in C/C++ and then arranging to release the GIL while you’re doing the computation. [this has to be done with care: you have to ensure that the GC knows about your array; I forget the details b/c it’s been so long since I did it …]
That might be worth another small factor on appropriate hardware.
In addition, you might consider, instead of two arrays, having a single byte-array of pairs of varints, were the first int is the -distance- to the next nonzero entry, and the second is the value of that entry. In the case (which I noticed in the Wikipedia entry) where your counts are either zero or one, you could then dispense with the second varint. Either way, this should yield memory savings, and that should improve perf, esp. if you’re using multiple cores in parallel.
Um, are you a C++ jock? B/c before I did a lot of this in Ocaml, I’d go do it in C++, where I could have more-precise control over layout, parallelism, etc.
Esp. in GCed languages, “allocation avoidance” and “memory-traversal avoidance” are enormously effective. I did something like this once in Coq, which is why I remembered the trick.