I’m trying to understand the performance differences between different ways of constructing a hash table and adding elements to it. Using the following code I measured the time-taken to insert 100000 integers into a hash table.
module IntHash =
struct
type t = int
let equal i j = i=j
let hash (i: int) = i land max_int
end
module IntHashtbl = Hashtbl.Make(IntHash)
let insert_into_inthashmap lst =
let tbl = IntHashtbl.create (List.length lst) in
List.iter (fun x -> IntHashtbl.add tbl x x) lst
let insert_into_hashmap lst =
let tbl = Hashtbl.create (List.length lst) in
List.iter (fun x -> Hashtbl.add tbl x x) lst
let insert_into_base_hashmap lst =
let tbl = Base.Hashtbl.create (module Base.Int) ~size:(List.length lst) in
List.iter (fun x -> Base.Hashtbl.set tbl ~key:x ~data:x) lst
let l = let open Base in List.range 0 100000
let () =
let open Core_bench in
Core.Command.run (Bench.make_command [
Bench.Test.create ~name:"insert_into_inthashmap" (fun () ->
ignore (insert_into_inthashmap l));
Bench.Test.create ~name:"insert_into_hashmap" (fun () ->
ignore (insert_into_hashmap l));
Bench.Test.create ~name:"insert_into_base_hashmap" (fun () ->
ignore (insert_into_base_hashmap l));
])
Below are the results I got,
Name Time/Run
insert_into_inthashmap 3.83ms
insert_into_hashmap 5.52ms
insert_into_base_hashmap 6.32ms
What causes the differences in execution speed between the different versions? Is it related to the hash functions that are used?
When you use the Hashtbl.Make functor, the code will use the equal/hash
functions you provide.
The polymorphic hashtable uses = and Hashtbl.hash, which
must go to C, inspect the value, etc.
The functorized one uses = specialized on integers, which compiles to a
few assembly instructions (because ocamlopt knows it’s only for
integers), and the hash function (which is very bad, note) is also a few
instructions.
I have no idea what’s up with the base hashtable, it’s a different
implementation.
Thanks for the response - that makes sense! The reason I had originally looked at the different options was because I noticed that I was struggling to match the speed achieved by using Python. For example, the following Python snippet runs faster than any of the OCaml versions (e.g. the version using the polymorphic hash table takes ~1.5x as long to insert 1 million elements),
def insert_into_dict(lst):
d = {}
for x in lst:
d[x] = x
I guess this seemed surprising and got me wondering if I was missing something.
You’re measuring fairly minor operations here, making it very easy to measure noise. For example, List.length itself is a slow operation on 100,000 items since the entire list needs to be iterated. Just use Hashtbl.create 100000. Additionally, List.range needs to allocate that entire list, which is slow as well. Python’s lists are actually vectors, so they’re more efficient. Finally, very little actual python code is being run here – it’s almost all C function calls underneath.