What is the fastest way to compute the weighted Jaccard index in OCaml?

optimization
sparse
integer-vectors
jaccard-index
generalized-tanimoto
#1

Hello,

What is the most efficient way to compute this:

given two sparse integer vectors A and B, we want to compute:

J_w(A,B) = |AnB| / (|A| + |B| - |AnB|)

With |A| being sum(a_i); i.e. the sum of all feature counts.
A feature is an index in the sparse vector.
What I call feature count is the integer value at that index.
All feature counts are integers, and >= 0.

|AnB| = sum(min(a_i, b_i)) over all i.
By the way, this other formula is also valid:
J_w(A,B) = sum(min(a_i, b_i)) / sum(max(a_i, b_i)); over all i.

I am open to all propositions. Let’s say that we are free to choose whichever datastructure/memory representation will make the computation faster.

Currently, I have a program which spends ~66% of its time doing J_w evaluations.
So, I am very interested if this could go faster.
I already use a bisector tree to prune the search space.
But in the end, I need to compute J_w for all vectors returned by a query.

Here are some sample data, if you want to play:

Thanks,
Francois.

#2

Hmm … um, can you point at what the recognized best data-structure/algorithm is in C++?

#3

Honestly, I have no idea.
In OCaml, I tried hashtbl, int IntMap.t and even bitstrings.
But now, I am down to this:

type elt = { k: int; v: int }
type t = elt list

And I make sure that the list is sorted by keys, at construction time.

#4

I could send my current code, but I would prefer to not influence people in any direction.

#5

I wonder if this is not the occasion to draw out the BER MetaOCaml sword. :slight_smile:

#6

Um, this is very off-the-cuff, but it should be a decent improvement, to switch to type t = { ks: int array; vs : int array }. I could be wildly off-base, but at least in my experience, whenever you have heavy read-operations that iterate down a list, it can pay to try to keep it in the form of an array.

2 Likes
#7

Interesting, I will give it a try.

#8

Once you’ve switched to an array-based representation, it should be possible to use SMP/multicore parallelism to speed things up, by implementing the core operation in C/C++ and then arranging to release the GIL while you’re doing the computation. [this has to be done with care: you have to ensure that the GC knows about your array; I forget the details b/c it’s been so long since I did it …]

That might be worth another small factor on appropriate hardware.

In addition, you might consider, instead of two arrays, having a single byte-array of pairs of varints, were the first int is the -distance- to the next nonzero entry, and the second is the value of that entry. In the case (which I noticed in the Wikipedia entry) where your counts are either zero or one, you could then dispense with the second varint. Either way, this should yield memory savings, and that should improve perf, esp. if you’re using multiple cores in parallel.

Um, are you a C++ jock? B/c before I did a lot of this in Ocaml, I’d go do it in C++, where I could have more-precise control over layout, parallelism, etc.

#9

I am currently trying with an int array, with keys and values alternating.

#10

Apparently, the single int array accelerated computation from 30min
to ~3min !

#11

Esp. in GCed languages, “allocation avoidance” and “memory-traversal avoidance” are enormously effective. I did something like this once in Coq, which is why I remembered the trick.

#12

What’s your benchmark? I’m in the middle of writing some C++ to do this, so I’ll need a way of comparing …

#13

For instance, would computing the index for each pair (#1, #2), (#3, #4) of vectors from your data-set be suitable?

#14

Are you speaking about an inverted index?
I think that’s what people use in search engines, so this might be relevant.

#15

Thanks to you, my code really looks like some C now. :smiley:
Maybe tomorrow, I’ll share my previous and current code.

#16

This is my final version, in OCaml.
It is fast enough now.

type t = int array

(* tani(A,B) = |inter(A,B)| / |union(A,B)|
             = sum(min_i) / sum(max_i) *)
let tanimoto (m1: t) (m2: t): float =
  let icard = ref 0 in
  let ucard = ref 0 in
  let len1 = A.length m1 in
  let len2 = A.length m2 in
  let i = ref 0 in
  let j = ref 0 in
  while !i < len1 && !j < len2 do
    (* unsafe *)
    let k1 = A.unsafe_get m1 !i in
    let v1 = A.unsafe_get m1 (!i + 1) in
    let k2 = A.unsafe_get m2 !j in
    let v2 = A.unsafe_get m2 (!j + 1) in
    (* process keys in increasing order *)
    if k1 < k2 then
      (ucard := !ucard + v1;
       i := !i + 2)
    else if k2 < k1 then
      (ucard := !ucard + v2;
       j := !j + 2)
    else (* k1 = k2 *)
    if v1 <= v2 then
      (icard := !icard + v1;
       ucard := !ucard + v2;
       i := !i + 2;
       j := !j + 2)
    else
      (icard := !icard + v2;
       ucard := !ucard + v1;
       i := !i + 2;
       j := !j + 2)
  done;
  incr i; (* go to value *)
  while !i < len1 do (* finish m1; unsafe *)
    ucard := !ucard + (A.unsafe_get m1 !i);
    i := !i + 2
  done;
  incr j; (* go to value *)
  while !j < len2 do (* finish m2; unsafe *)
    ucard := !ucard + (A.unsafe_get m2 !j);
    j := !j + 2
  done;
  if !ucard = 0 then 0.0
  else (float !icard) /. (float !ucard)

It looks like some C code…