What is the fastest way to compute the weighted Jaccard index in OCaml?

UnixJunkie · April 24, 2019, 1:44am

Hello,

What is the most efficient way to compute this:

given two sparse integer vectors A and B, we want to compute:

J_w(A,B) = |AnB| / (|A| + |B| - |AnB|)

With |A| being sum(a_i); i.e. the sum of all feature counts.
A feature is an index in the sparse vector.
What I call feature count is the integer value at that index.
All feature counts are integers, and >= 0.

|AnB| = sum(min(a_i, b_i)) over all i.
By the way, this other formula is also valid:
J_w(A,B) = sum(min(a_i, b_i)) / sum(max(a_i, b_i)); over all i.

I am open to all propositions. Let’s say that we are free to choose whichever datastructure/memory representation will make the computation faster.

Currently, I have a program which spends ~66% of its time doing J_w evaluations.
So, I am very interested if this could go faster.
I already use a bisector tree to prune the search space.
But in the end, I need to compute J_w for all vectors returned by a query.

Here are some sample data, if you want to play:

gist.github.com

https://gist.github.com/UnixJunkie/d9c33b5ddb87a9440e94334c057ffc6d

example data

1,[0:7;1:4;2:2;3:1;4:6;5:1;6:1;7:1;8:1;9:1;10:1;11:3;12:1;13:2;14:2;15:1;16:1;17:7;18:1;19:1;20:2;21:3;22:1;23:1;24:1;25:1;26:2;27:1;28:1;29:1;30:1;31:1;32:1]
2,[0:6;1:2;2:3;4:3;11:1;12:1;14:3;16:2;17:1;22:1;24:1;33:1;34:1;35:1;36:1;37:2;38:1;39:1;40:1;41:1;42:1;43:1;44:1;45:1;46:1;47:1;48:1;49:1;50:1;51:1;52:1;53:1]
3,[0:6;1:6;4:7;14:3;15:1;17:2;19:1;29:2;30:1;37:3;54:1;55:1;56:1;57:1;58:1;59:1;60:1;61:1;62:2;63:1;64:1;65:1;66:1;67:1;68:1]
4,[0:6;1:3;2:1;3:1;4:5;14:4;16:1;17:2;21:1;22:1;24:1;25:1;26:1;29:1;63:1;65:2;66:1;68:1;69:1;70:2;71:1;72:1;73:1;74:1;75:1;76:1;77:1;78:1;79:1]
5,[0:5;1:2;2:3;4:7;11:1;12:1;14:1;15:1;17:4;18:1;21:1;24:2;25:2;26:1;42:1;44:1;45:1;56:1;75:1;76:1;80:1;81:1;82:1;83:1;84:1;85:1;86:1;87:1;88:1;89:1;90:1;91:1]
6,[0:3;1:2;3:1;4:4;11:2;12:1;14:1;17:4;21:2;26:1;29:1;56:1;65:1;70:1;77:2;92:1;93:1;94:1;95:1;96:1;97:1;98:1;99:1;100:1]
7,[0:11;1:6;2:4;3:1;4:9;5:1;6:2;7:1;11:2;12:2;17:6;21:4;22:1;24:2;25:1;26:2;47:1;52:1;56:1;93:1;97:2;101:2;102:1;103:1;104:1;105:2;106:2;107:1;108:1]
8,[0:9;1:4;2:3;4:3;11:1;12:1;14:1;15:1;17:7;19:1;21:1;26:1;29:1;30:1;37:1;49:1;89:1;91:1;109:2;110:2;111:1;112:1;113:1;114:1;115:1;116:1;117:1]
9,[0:8;1:8;4:7;9:1;11:5;12:3;13:2;14:1;17:7;21:3;22:2;26:2;28:1;29:2;31:1;32:1;56:1;66:1;68:1;77:3;79:1;93:1;100:1;105:1;106:1;115:1;118:1;119:1;120:2;121:1;122:1]
10,[0:3;1:2;3:1;4:7;8:2;11:1;12:1;14:2;17:2;24:1;25:1;29:2;30:2;36:1;37:1;61:1;63:1;64:1;65:2;72:1;84:1;123:1;124:1;125:1;126:1]

This file has been truncated. show original

Thanks,
Francois.

Chet_Murthy · April 24, 2019, 1:48am

Hmm … um, can you point at what the recognized best data-structure/algorithm is in C++?

UnixJunkie · April 24, 2019, 1:50am

Honestly, I have no idea.
In OCaml, I tried hashtbl, int IntMap.t and even bitstrings.
But now, I am down to this:

type elt = { k: int; v: int }
type t = elt list

And I make sure that the list is sorted by keys, at construction time.

UnixJunkie · April 24, 2019, 1:56am

I could send my current code, but I would prefer to not influence people in any direction.

UnixJunkie · April 24, 2019, 2:03am

I wonder if this is not the occasion to draw out the BER MetaOCaml sword.

Chet_Murthy · April 24, 2019, 2:15am

Um, this is very off-the-cuff, but it should be a decent improvement, to switch to type t = { ks: int array; vs : int array }. I could be wildly off-base, but at least in my experience, whenever you have heavy read-operations that iterate down a list, it can pay to try to keep it in the form of an array.

UnixJunkie · April 24, 2019, 2:17am

Interesting, I will give it a try.

Chet_Murthy · April 24, 2019, 2:28am

Once you’ve switched to an array-based representation, it should be possible to use SMP/multicore parallelism to speed things up, by implementing the core operation in C/C++ and then arranging to release the GIL while you’re doing the computation. [this has to be done with care: you have to ensure that the GC knows about your array; I forget the details b/c it’s been so long since I did it …]

That might be worth another small factor on appropriate hardware.

In addition, you might consider, instead of two arrays, having a single byte-array of pairs of varints, were the first int is the -distance- to the next nonzero entry, and the second is the value of that entry. In the case (which I noticed in the Wikipedia entry) where your counts are either zero or one, you could then dispense with the second varint. Either way, this should yield memory savings, and that should improve perf, esp. if you’re using multiple cores in parallel.

Um, are you a C++ jock? B/c before I did a lot of this in Ocaml, I’d go do it in C++, where I could have more-precise control over layout, parallelism, etc.

UnixJunkie · April 24, 2019, 3:09am

I am currently trying with an int array, with keys and values alternating.

UnixJunkie · April 24, 2019, 3:46am

Apparently, the single int array accelerated computation from 30min
to ~3min !

Chet_Murthy · April 24, 2019, 3:47am

Esp. in GCed languages, “allocation avoidance” and “memory-traversal avoidance” are enormously effective. I did something like this once in Coq, which is why I remembered the trick.

Chet_Murthy · April 24, 2019, 3:48am

What’s your benchmark? I’m in the middle of writing some C++ to do this, so I’ll need a way of comparing …

Chet_Murthy · April 24, 2019, 3:57am

For instance, would computing the index for each pair (#1, #2), (#3, #4) of vectors from your data-set be suitable?

UnixJunkie · April 24, 2019, 6:19am

Are you speaking about an inverted index?
I think that’s what people use in search engines, so this might be relevant.

UnixJunkie · April 24, 2019, 6:19am

Thanks to you, my code really looks like some C now.
Maybe tomorrow, I’ll share my previous and current code.

UnixJunkie · April 25, 2019, 6:49am

This is my final version, in OCaml.
It is fast enough now.

type t = int array

(* tani(A,B) = |inter(A,B)| / |union(A,B)|
             = sum(min_i) / sum(max_i) *)
let tanimoto (m1: t) (m2: t): float =
  let icard = ref 0 in
  let ucard = ref 0 in
  let len1 = A.length m1 in
  let len2 = A.length m2 in
  let i = ref 0 in
  let j = ref 0 in
  while !i < len1 && !j < len2 do
    (* unsafe *)
    let k1 = A.unsafe_get m1 !i in
    let v1 = A.unsafe_get m1 (!i + 1) in
    let k2 = A.unsafe_get m2 !j in
    let v2 = A.unsafe_get m2 (!j + 1) in
    (* process keys in increasing order *)
    if k1 < k2 then
      (ucard := !ucard + v1;
       i := !i + 2)
    else if k2 < k1 then
      (ucard := !ucard + v2;
       j := !j + 2)
    else (* k1 = k2 *)
    if v1 <= v2 then
      (icard := !icard + v1;
       ucard := !ucard + v2;
       i := !i + 2;
       j := !j + 2)
    else
      (icard := !icard + v2;
       ucard := !ucard + v1;
       i := !i + 2;
       j := !j + 2)
  done;
  incr i; (* go to value *)
  while !i < len1 do (* finish m1; unsafe *)
    ucard := !ucard + (A.unsafe_get m1 !i);
    i := !i + 2
  done;
  incr j; (* go to value *)
  while !j < len2 do (* finish m2; unsafe *)
    ucard := !ucard + (A.unsafe_get m2 !j);
    j := !j + 2
  done;
  if !ucard = 0 then 0.0
  else (float !icard) /. (float !ucard)

It looks like some C code…

Topic		Replies	Views
[ANN] First release of Art - Adaptive Radix Tree in OCaml Community announce , datastructures	3	1477	January 28, 2021
Any suggestions on making this piece of OCaml numerical code more efficient? Learning performance , numerical , comp-chem	28	1092	February 27, 2024
[ANN] T-Digest library Community announce	10	840	September 10, 2023
Is there a faster way to calculate the 2D Minimum Spanning Tree? Community optimization	1	884	April 1, 2022
A Random Forests classifier in OCaml Community machine-learning , regression , classification , random-forests	3	882	June 17, 2021

What is the fastest way to compute the weighted Jaccard index in OCaml?

Related topics