When is hashtable fast or slow? #2395
Replies: 11 comments 4 replies
-
One important thing that I learned is to not nest things too deeply, because the hascode only uses the first 3 levels of nesting: ! 3 levels deep, fast
10000 <iota> [ 1array 1array 1array ] map
H{ } clone [
[ swapd set-at ] curry each-index
] time
! 0.006891168 seconds ! 4 levels deep, slow
10000 <iota> [ 1array 1array 1array 1array ] map
H{ } clone [
[ swapd set-at ] curry each-index
] time
! Running time: 2.424302447 seconds
! slow because every set-at triggers a hash collision You can see it like this:
|
Beta Was this translation helpful? Give feedback.
-
Well what's described in the issue is that the hashing algorithm for sequences has a tendency to ouptut the same hashes for sequences that have similar numbers, for example: So you get collisions often. For pairs of numbers, if you hash all pairs between 0 and 100 (10000 pairs), you get 3000 distinct hashes, so 30%. If you hash all pairs between 0 and 1000 (1M pairs), you get 30k distinct hashes, so about 3%. The tuple hashing algorithm is different, and produces more varied hashes (in fact, if you have a Choosing a hashing algorithm is a tradeoff between the speed of the hashing and the distribution of the hashes I guess:
sequence hashcode is a little faster, although not much. The ideal solution here would be to know what tradeoffs were considered when choosing the sequence hashcode algorithm, and decide if we want to make a new decision. Also I have never implemented a production grade hashing algorithm, so there are probably more things to consider ! |
Beta Was this translation helpful? Give feedback.
-
Quadratically scanning the array!
I changed it from linear to greatly speed up the ant benchmark, I think.
… On Dec 16, 2020, at 6:01 AM, Jon Harper ***@***.***> wrote:
I see. So, in the case of such deeply nested objects, the hash codes will be the same, causing collisions and thus slowing down the operation, right?
Yes. Then it behaves more a less like a linked list (the hashtable is backed by an array, when an item is already present at the slot computed from the hashcode of a new item, it places the item in the first free slot after it by linearly scanning the array. Same thing for getting and item and the key at the slot is not the key you have, you scan the array linearly for your key)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Hash code collisions seem to be something to consider in the problem I'm working on. I have created 100 symbol words and used their combinations as keys in a hash table. men get .
=> V{
the-man-No.001
the-man-No.002
the-man-No.003
...
the-man-No.098
the-man-No.099
the-man-No.100
}
V{ } clone
men get dup length -rot
[ hashcode ] map
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode: 100.00% However, if I create an array of pairs of all the combinations and thereby create a hash code, it appears that about half of them are not unique. V{ } clone
men get dup cartesian-product concat >array dup length -rot
[ hashcode ] map
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode: 49.21% In this test, each array has two elements, but when used in practice, the number of elements is arbitrary. I hope there is a better solution. |
Beta Was this translation helpful? Give feedback.
-
Using a tuple to hold your pairs brings it back to 100%... maybe we can improve <<
100 [1,b] [
"the-man-No.%03d" sprintf create-word-in define-symbol
] each
>>
CONSTANT: men $[ 100 [1,b] [ "the-man-No.%03d" sprintf search ] map ]
TUPLE: foo x y ;
V{ } clone
men dup cartesian-product concat >array dup length -rot
[ first2 foo boa hashcode ] map
[ over adjoin ] each length swap / 100.0 * "unique hashcode: %6.2f%% \n" printf
=> unique hashcode: 100.00% |
Beta Was this translation helpful? Give feedback.
-
I'm working today on making it better, let me see... having a bit of trouble bootstrapping with a new hashcode for some reason, need to see why |
Beta Was this translation helpful? Give feedback.
-
But isn't it still faster even though there are more collisions, according to the example above? |
Beta Was this translation helpful? Give feedback.
-
Note to self: you can change all the other |
Beta Was this translation helpful? Give feedback.
-
As long as you make sure that string hashcode uses the current algorithm, you can change sequence-hashcode.
The best way is to probably copy the current algorithm into strings.factor.
Then change to your hearts content!
… On Dec 18, 2020, at 5:16 AM, kusumotonorio ***@***.***> wrote:
I would like to change sequence-hashcode-step, but then Factor gets stuck.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Specifically, put this in ``core/strings/strings.factor:
Then you should be able to bootstrap just fine. |
Beta Was this translation helpful? Give feedback.
-
Actually, I'll go ahead and do that in 5e6e838, and then we can explore |
Beta Was this translation helpful? Give feedback.
-
In a problem I'm working on, I think the bottleneck is storing and retrieving to/from a hash table.
I'm wondering if there's a way to make it faster, and I'm curious about this issue I came across earlier.
What is the reason for such a difference? What should I keep in mind when using hashetables?
Beta Was this translation helpful? Give feedback.
All reactions