When is hashtable fast or slow? #2395

kusumotonorio · 2020-12-15T13:33:57Z

kusumotonorio
Dec 15, 2020

In a problem I'm working on, I think the bottleneck is storing and retrieving to/from a hash table.
I'm wondering if there's a way to make it faster, and I'm curious about this issue I came across earlier.

What is the reason for such a difference? What should I keep in mind when using hashetables?

jonenst · 2020-12-15T14:05:06Z

jonenst
Dec 15, 2020
Collaborator

One important thing that I learned is to not nest things too deeply, because the hascode only uses the first 3 levels of nesting:

! 3 levels deep, fast
10000 <iota> [ 1array 1array 1array  ] map
H{ } clone [ 
  [ swapd set-at ] curry each-index
] time
! 0.006891168 seconds

! 4 levels deep, slow
10000 <iota> [ 1array 1array 1array 1array  ] map
H{ } clone [ 
  [ swapd set-at ] curry each-index
] time
! Running time: 2.424302447 seconds
! slow because every set-at triggers a hash collision

You can see it like this:

! 4 levels deep, not used for hashcodes, anything has hashcode 0
{ { { { "ANYTHING" } } } } hashcode . ! 0
{ { { { 2 } } } } hashcode . ! 0
! 3 levels deep, ok
{ { { "ANYTHING" } } } hashcode . ! 2238190891456
{ { { 2 } } } hashcode . ! 2

2 replies

kusumotonorio Dec 16, 2020
Author

I see. So, in the case of such deeply nested objects, the hash codes will be the same, causing collisions and thus slowing down the operation, right?

In the previous link, it is shown that there is a big difference in speed between tuple and array, what is the reason for this difference?

jonenst Dec 16, 2020
Collaborator

I see. So, in the case of such deeply nested objects, the hash codes will be the same, causing collisions and thus slowing down the operation, right?

Yes. Then it behaves more a less like a linked list (the hashtable is backed by an array, when an item is already present at the slot computed from the hashcode of a new item, it places the item in the first free slot after it by linearly scanning the array. Same thing for getting and item and the key at the slot is not the key you have, you scan the array linearly for your key)

jonenst · 2020-12-16T13:48:41Z

jonenst
Dec 16, 2020
Collaborator

Well what's described in the issue is that the hashing algorithm for sequences has a tendency to ouptut the same hashes for sequences that have similar numbers, for example:
{ 61 94 } { 62 63 } { 63 32 } { 64 96 } { 65 65 } { 66 34 } { 67 3 } all these hash to the same number 2096.

So you get collisions often. For pairs of numbers, if you hash all pairs between 0 and 100 (10000 pairs), you get 3000 distinct hashes, so 30%. If you hash all pairs between 0 and 1000 (1M pairs), you get 30k distinct hashes, so about 3%.

The tuple hashing algorithm is different, and produces more varied hashes (in fact, if you have a TUPLE: pair x y, you have distinct hashes for all the tuples build with pairs of numbers).

Choosing a hashing algorithm is a tradeoff between the speed of the hashing and the distribution of the hashes I guess:

: hashcode-seq-bench ( -- )
0 3000 [a,b] dup cartesian-product concat
 [ [ hashcode ] map ] time drop ;
 TUPLE: pair x y ;
: hashcode-tuple-bench ( -- )
0 3000 [a,b] dup cartesian-product concat
[ first2 pair boa ] map
 [ [ hashcode ] map ] time drop ;
! seq: 0.333672249 seconds
! tuple: 0.360901264 seconds

sequence hashcode is a little faster, although not much. The ideal solution here would be to know what tradeoffs were considered when choosing the sequence hashcode algorithm, and decide if we want to make a new decision.

Also I have never implemented a production grade hashing algorithm, so there are probably more things to consider !

0 replies

mrjbq7 · 2020-12-16T14:04:58Z

mrjbq7
Dec 16, 2020
Maintainer

Quadratically scanning the array! I changed it from linear to greatly speed up the ant benchmark, I think.

…

On Dec 16, 2020, at 6:01 AM, Jon Harper ***@***.***> wrote: I see. So, in the case of such deeply nested objects, the hash codes will be the same, causing collisions and thus slowing down the operation, right? Yes. Then it behaves more a less like a linked list (the hashtable is backed by an array, when an item is already present at the slot computed from the hashcode of a new item, it places the item in the first free slot after it by linearly scanning the array. Same thing for getting and item and the key at the slot is not the key you have, you scan the array linearly for your key) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

1 reply

jonenst Dec 16, 2020
Collaborator

Ah yes you're right, I misread the code. Thanks for the correction. Scanning quadratically does allow you to get away from parts of the array that are full in case the distribution of hashes is skewed. But if 100% of hashes collide, it's still like a linked list.

kusumotonorio · 2020-12-17T11:29:51Z

kusumotonorio
Dec 17, 2020
Author

Hash code collisions seem to be something to consider in the problem I'm working on.

I have created 100 symbol words and used their combinations as keys in a hash table.
If I create a hash code with them alone, though, the rate at which the resulting key is unique is 100%.

men get .
=> V{
        the-man-No.001
        the-man-No.002
        the-man-No.003
        ...
        the-man-No.098
        the-man-No.099
        the-man-No.100
    }

V{ } clone
men get dup length -rot
[ hashcode ] map 
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode: 100.00%

However, if I create an array of pairs of all the combinations and thereby create a hash code, it appears that about half of them are not unique.

V{ } clone
men get dup cartesian-product concat >array dup length -rot
[ hashcode ] map
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode:  49.21%

In this test, each array has two elements, but when used in practice, the number of elements is arbitrary. I hope there is a better solution.

0 replies

mrjbq7 · 2020-12-17T17:19:38Z

mrjbq7
Dec 17, 2020
Maintainer

Using a tuple to hold your pairs brings it back to 100%... maybe we can improve sequences-hashcode...

<<
100 [1,b] [
    "the-man-No.%03d" sprintf create-word-in define-symbol
] each
>>

CONSTANT: men $[ 100 [1,b] [ "the-man-No.%03d" sprintf search ] map ]

TUPLE: foo x y ;

V{ } clone
men dup cartesian-product concat >array dup length -rot
[ first2 foo boa hashcode ] map
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode: 100.00%

0 replies

mrjbq7 · 2020-12-17T18:16:54Z

mrjbq7
Dec 17, 2020
Maintainer

I'm working today on making it better, let me see... having a bit of trouble bootstrapping with a new hashcode for some reason, need to see why rehash isn't working as expected.

0 replies

timor · 2020-12-17T18:29:53Z

timor
Dec 17, 2020

But isn't it still faster even though there are more collisions, according to the example above?

0 replies

mrjbq7 · 2020-12-17T21:48:42Z

mrjbq7
Dec 17, 2020
Maintainer

Note to self: you can change all the other hashcode implementations, except string... for some reason changing that hashcode causes problems in bootstrap. Need to modify stage1 to do something in addition to [ string? ] instances [ rehash-string ] each...

1 reply

kusumotonorio Dec 18, 2020
Author

I would like to change sequence-hashcode-step, but then Factor gets stuck.

mrjbq7 · 2020-12-18T14:49:08Z

mrjbq7
Dec 18, 2020
Maintainer

As long as you make sure that string hashcode uses the current algorithm, you can change sequence-hashcode. The best way is to probably copy the current algorithm into strings.factor. Then change to your hearts content!

…

On Dec 18, 2020, at 5:16 AM, kusumotonorio ***@***.***> wrote: I would like to change sequence-hashcode-step, but then Factor gets stuck. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

mrjbq7 · 2020-12-18T15:21:51Z

mrjbq7
Dec 18, 2020
Maintainer

Specifically, put this in ``core/strings/strings.factor:

: rehash-string ( str -- )
    0 over [
        swap [
            [ -2 fixnum-shift-fast ] [ 5 fixnum-shift-fast ] bi
            fixnum+fast fixnum+fast
        ] keep fixnum-bitxor
    ] each swap set-string-hashcode ; inline

Then you should be able to bootstrap just fine.

0 replies

mrjbq7 · 2020-12-18T15:27:13Z

mrjbq7
Dec 18, 2020
Maintainer

Actually, I'll go ahead and do that in 5e6e838, and then we can explore sequence-hashcode changes more easily, and then later come back to why strings can't have their hash codes changed before bootstrap.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When is hashtable fast or slow? #2395

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

When is hashtable fast or slow? #2395

kusumotonorio Dec 15, 2020

Replies: 11 comments · 4 replies

jonenst Dec 15, 2020 Collaborator

kusumotonorio Dec 16, 2020 Author

jonenst Dec 16, 2020 Collaborator

jonenst Dec 16, 2020 Collaborator

mrjbq7 Dec 16, 2020 Maintainer

jonenst Dec 16, 2020 Collaborator

kusumotonorio Dec 17, 2020 Author

mrjbq7 Dec 17, 2020 Maintainer

mrjbq7 Dec 17, 2020 Maintainer

timor Dec 17, 2020

mrjbq7 Dec 17, 2020 Maintainer

kusumotonorio Dec 18, 2020 Author

mrjbq7 Dec 18, 2020 Maintainer

mrjbq7 Dec 18, 2020 Maintainer

mrjbq7 Dec 18, 2020 Maintainer

kusumotonorio
Dec 15, 2020

Replies: 11 comments 4 replies

jonenst
Dec 15, 2020
Collaborator

kusumotonorio Dec 16, 2020
Author

jonenst Dec 16, 2020
Collaborator

jonenst
Dec 16, 2020
Collaborator

mrjbq7
Dec 16, 2020
Maintainer

jonenst Dec 16, 2020
Collaborator

kusumotonorio
Dec 17, 2020
Author

mrjbq7
Dec 17, 2020
Maintainer

mrjbq7
Dec 17, 2020
Maintainer

timor
Dec 17, 2020

mrjbq7
Dec 17, 2020
Maintainer

kusumotonorio Dec 18, 2020
Author

mrjbq7
Dec 18, 2020
Maintainer

mrjbq7
Dec 18, 2020
Maintainer

mrjbq7
Dec 18, 2020
Maintainer