Best way to calculate hash values for all substrings in a string?

HoraDeComer · May 4, 2020, 10:58pm

I got this code:

let a = "ABCDEFGHIJ";;

What is the best and efficient way to calculate hash values for all substrings in a string?
Some substrings of string “a”.
A → 65
AB → 23
BC → 324
ABC → 3234
…

bobbypriambodo · May 5, 2020, 12:18am

Since the hash values are shown in the examples, is the hash function already provided as well?

Let me see if I can answer this. Looking at the problem, since you need to calculate “all substrings”, you will end up with at least an O(n^2) time complexity for getting the start and end indexes of each substring (with a simple nested loop), because there will be n^2 substrings for a string of length n.

After we get the indexes, the next step is to calculate the hash function. A naive approach would be to calculate the hash for the whole substring in each iteration, which may give us an O(n) time complexity if we need to go through all the characters of the substring from start to end indexes again. This gives the overall time complexity of O(n^3), and O(n^2) space complexity to store the final result.

We may be able to further optimize this if the hash function is such that for a string “ABCD”, the value of hash("ABCD") can be achieved by doing constant math calculation of hash("ABC") and hash("D") (i.e., the hash value of a string is calculated from the hash value of the substring up to the second last character and and the hash value of the last character, or let hash x = f(hash(x[start:end-1]), hash(x[end])) for a constant function f).

With this, we can calculate the hash values constantly during the nested loop, store the result in some kind of map with start and end index as key, and use that result for the next calculation, which gives us the final time complexity of O(n^2). It still has the O(n^2) space complexity for the final result and the temporary map.

The hash function may be derivable from the examples, e.g. since A -> 65 then it might be related to a calculation with the ASCII representation of the characters.

HoraDeComer · May 5, 2020, 1:11am

I need some function more efficient than Hashtbl.hash “ABCDE”;;. Some function using lxor

Splingush · May 5, 2020, 6:44am

https://en.wikipedia.org/wiki/Rolling_hash might be relevant, though I’m not hugely familiar with it.

HoraDeComer · May 5, 2020, 11:44am

The string only have chars ‘A’, ‘C’, ‘T’ and’G’. Example AATG
I try this code:

let a = 1;;
let c = 2;;
let g = 3;;
let t = 4;;
let base = 100003;;
let f1((str:string)) : int =
  let hs = ref 0 in
  let lstr = String.length str in
  for i = 0 to lstr-1 do
    if String.get str i = 'A' then
      hs := (!hs + a) mod lstr
    else if String.get str i = 'C' then
      hs := (!hs + c) mod lstr
    else if String.get str i = 'G' then
      hs := (!hs + g) mod lstr
    else
      hs := (!hs + t) mod lstr
  done;
  !hs;
;;

but for the strings with length high (>100000), this function f1 it’s slower than Hashtbl.hash “SameString”;;. Any suggestions?

314eter · May 6, 2020, 2:49am

Since Hashtbl.hash is highly optimized C, it’s almost impossible to write a more efficient hash function in OCaml without using some kind of rolling hash or @bobbypriambodo’s suggestion.

The bottleneck in your hash function is probably the mod, which is a slow operation, and maybe the fact that you’re repeatedly doing String.get str i for the same i.

Apart from that, your hash function is very weak. All permutations of a string (e.g. ACGA, GACA, AACG, CGAA, CAAG, …) will hash to the same value.

Topic		Replies	Views
[ANN] Digestif 0.7.1 Community announce , hash	0	1240	November 22, 2018
How to find the key of the largest element in a hash table Learning	5	1080	May 17, 2022
Why there is no easy pattern matching for strings? Learning language-design	6	8469	May 28, 2017
OCaml runtime: hash cons‘ing for strings? Learning	3	794	May 29, 2020
Understanding performance differences using Hashtbl Learning	4	1225	November 27, 2021

Best way to calculate hash values for all substrings in a string?

Related topics