Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrete-Incremental Hashing #49

Open
avaneev opened this issue Nov 11, 2022 · 28 comments · Fixed by #50
Open

Discrete-Incremental Hashing #49

avaneev opened this issue Nov 11, 2022 · 28 comments · Fixed by #50
Labels
enhancement New feature or request

Comments

@avaneev
Copy link

avaneev commented Nov 11, 2022

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.

@avaneev avaneev added the enhancement New feature or request label Nov 11, 2022
@oertl
Copy link
Member

oertl commented Nov 11, 2022

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

This is due to Java's UTF16-like in-memory representation of strings which uses 2 bytes per character.

@oertl
Copy link
Member

oertl commented Nov 11, 2022

Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

I will do some benchmarks to check if this makes a difference. This is an optimization, which the Java compiler could probably do automatically.

@avaneev
Copy link
Author

avaneev commented Nov 11, 2022

Thanks. Didn't notice the char values are UTF-16.

@oertl
Copy link
Member

oertl commented Nov 11, 2022

The incremental approach might be an interesting alternative. If I find the time I will do an alternative implementation and look at the benchmarks.

I am not an expert in designing hash functions, but I am wondering if this incremental approach might lead to a worse hash-quality? My gut feeling is that it is better to serialize an object and then apply a hash function once than to apply the hash function recursively on chunks. After all, there must be a reason why most hash functions use an internal state that is greater than 8 bytes.

@avaneev
Copy link
Author

avaneev commented Nov 11, 2022

The quality is fine. Komihash has 128-bit "state space", so collisions do not happen. Anyway, I've tested the approach in SMHasher without issues.

You can optimize incremental hashing for char, byte, short, long value inputs quite a bit - instead of calling a single static hash function you can add "truncated" implementations for specific input lengthes.

@oertl
Copy link
Member

oertl commented Nov 11, 2022

What I mean is that only 64 bits of information is transferred from one chunk to another with the incremental aproach. If everything is hashed at once, there is much more information that can somehow interact due to the larger internal state of the hash function.

@avaneev
Copy link
Author

avaneev commented Nov 11, 2022

Well, look at hash value as a "starting point". The next hashing round moves this "starting point" to another position. This may not work with other hashes, but it works with komihash due to its qualities. Whether you hash in a single pass or split the input into separate incremental byte-wide inputs, the hash quality remains the same, I've tested this.

@oertl oertl linked a pull request Nov 14, 2022 that will close this issue
@oertl
Copy link
Member

oertl commented Nov 14, 2022

@avaneev, I have implemented your suggestions (see #50), but could not see any performance improvement in my benchmarks. I assume that the Java compiler does this kind of optimizations automatically.

@avaneev
Copy link
Author

avaneev commented Nov 14, 2022

Understood. Anyway, it looks cleaner that way I think.

@oertl oertl closed this as completed in #50 Nov 14, 2022
@oertl
Copy link
Member

oertl commented Dec 7, 2022

reopening this issue as it was unintentionally closed

@oertl oertl reopened this Dec 7, 2022
@oertl
Copy link
Member

oertl commented Dec 7, 2022

@avaneev, after implementing the streaming variant (see avaneev/komihash#8), do you think the incremental approach is still worth considering? I tend to keep the current approach, which is also more safe for other hash algorithms that do not have the required properties for the incremental approach.

@avaneev
Copy link
Author

avaneev commented Dec 7, 2022

Discrete-incremental approach is very useful for database hashing, not only because it is faster, but also because it implicitly separates the data items. For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes. Discrete-incremental hashing will produce two different hashes in this case.

@oertl
Copy link
Member

oertl commented Dec 7, 2022

For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes.

This is why hash4j provides methods that append the lengths of variable-length fields to make hashes unique. See for example:

/**
* Adds a string to the hash computation.
*
* <p>This method includes the length information. In this way,
*
* <p>{@code hashSink.putString}{@code ("AB").putString}{@code ("C")}
*
* <p>and
*
* <p>{@code hashSink.putString}{@code ("A").putString}{@code ("BC")}
*
* <p>will be different contributions to the hash value computation.
*
* <p>Equivalent to <br>
* {@code putChars(s).putInt(s.length());}
*
* @param s the string
* @return this
*/
HashSink putString(String s);

@avaneev
Copy link
Author

avaneev commented Dec 7, 2022

Yes, I understand. But the approach is not ideal, because length is a binary value and in some cases if you append an independent binary value after the string, a collision may happen anyway.

@oertl
Copy link
Member

oertl commented Dec 7, 2022

Using the binary representation after serializing the object avoids hash collisions by definition. Appending the lengths of all variable length fields of an object is equivalent to serialization, since the object can be reconstructed from this information. Therefore, the hash4j approach should be fine.

@avaneev
Copy link
Author

avaneev commented Dec 7, 2022

I think this leaves a chance to programming error. Beside that, appending string's length is an unnecessary overhead compared to discrete-incremental approach.

@oertl
Copy link
Member

oertl commented Dec 7, 2022

For nested data structures like a list of lists (e.g. List<List<Integer>>) the incremental approach is not straightforward either. Either one needs another hasher instance or one has to add the lengths of the inner lists.

@avaneev
Copy link
Author

avaneev commented Dec 7, 2022

For nested structures, it will be enough to add zero-length hashing round Hash = komihash( &tmp, 0, Hash ); for each nesting. It's a data-less round, and it's very fast, at least in C.

@oertl
Copy link
Member

oertl commented Dec 7, 2022

Do you mean that

{{"ab", "cd"}, {"ef", "gh"}}

would be hashed as

komihash << "ab" << "cd" << "" << "ef" << "gh"

where the <<-operator denotes a hashing round and "" indicates a zero-length hashing round?

But then

{{"ab", "cd", "", "ef", "gh"}}

with an empty string in the middle would give a hash collision.

@avaneev
Copy link
Author

avaneev commented Dec 8, 2022

It should be more like
komihash << "" << "ab" << "cd" << "" << "ef" << "gh"

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

@avaneev
Copy link
Author

avaneev commented Dec 8, 2022

So, a sub-hasher is probably needed that will look like:
komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")
From performance point, it's as fast as adding zero-lengthes. But anyway, a counter-example can still be given, except this will involve random numbers, so it won't look obvious.

@avaneev
Copy link
Author

avaneev commented Dec 8, 2022

In my proposal I've assumed zero-terminated strings, so their lengths are non-zero even when they are empty. But that's also not perfect, because individual zero binary bytes can also happen in a hash stream.

@oertl
Copy link
Member

oertl commented Dec 8, 2022

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

When lengths (also those of the strings) are always appended, it should be fine:

{{"ab", "cd"}, {"ef", "gh"}}

would lead to a binary representation corresponding to

['a', 'b', 2, 'c', 'd', 2, 2, 'e', 'f', 2, 'g', 'h', 2, 2, 2]

wheras

{{"ab", "cd", "", "ef", "gh"}}

would give

['a', 'b', 2, 'c', 'd', 2, 0, 'e', 'f', 2, 'g', 'h', 2, 5, 1].

Of course, for tiny strings there is a lot of overhead, but in general it is a safe strategy which works with just a single hasher instance.

So, a sub-hasher is probably needed that will look like:
komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")

This is what I meant with "another hasher instance". This strategy requires creating further instances which also comes with some overhead (especially in Java). Hashing a nested data structure requires as many hasher instances as nesting levels.

@avaneev
Copy link
Author

avaneev commented Dec 8, 2022

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only.

Your example above collides with
{"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

@oertl
Copy link
Member

oertl commented Dec 8, 2022

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only

Agree, but this could be difficult to realize in Java without additional overhead or object allocations.

Your example above collides with
{"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

This is different, as this is a list of some polymorphic type (able to represent integers and strings). If you hash a polymorphic object you always have to include the type information. Then there will be no collision.

@avaneev
Copy link
Author

avaneev commented Dec 10, 2022

I think that additional overhead for hash value stack would be much lower than the overhead associated with storing string lengths and types (for polymorphic objects), and buffering of course. Discrete-incremental capability may be a new word in high-performance database hashing since it is fast, universal, and type-agnostic.

@oertl
Copy link
Member

oertl commented Dec 10, 2022

I think type information is also necessary for the incremental approach in case of polymorphic types. Since, as you mentioned, Komihash might be the only algorithm that has the properties required for this incremental approach, I still think it is better to use the current approach by default. However, I see the potential and will consider adding this incremental approach as an alternative in the future. I will keep this issue open in the meanwhile.

You mentioned that there are statistical tests that prove that Komihash produces equally good hashes when it is run incrementally (e.g. when only one byte is added in each hashing round). Are such tests already part of smhasher?

@avaneev
Copy link
Author

avaneev commented Dec 10, 2022

No, it is not in SMHasher. Here's the code you can try if you have SMHasher, simply change komihash() to komihash_di() call in SMHasher code. (its hash identifier differs, though)

static inline uint64_t komihash_di( const void* const Msg, const size_t MsgLen,
	const uint64_t UseSeed )
{
	if( MsgLen == 0 )
	{
		char tmp;
		return( komihash( &tmp, 0, UseSeed ));
	}
	uint64_t Hash = UseSeed;
	for( size_t i = 0; i < MsgLen; i++ )
	{
		Hash = komihash( (const uint8_t*) Msg + i, 1, Hash );
	}
	return( Hash );
}

@avaneev avaneev changed the title Incremental hashing Discrete-Incremental Hashing Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants