New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: platform accordance while calculating murmur3 #1636
base: master
Are you sure you want to change the base?
Conversation
Welcome to GitGitGadgetHi @cdegree, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that your Pull Request has a good description, as it will be used as cover letter. You can CC potential reviewers by adding a footer to the PR description with the following syntax:
Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
There are issues in commit 0981c03: |
There are issues in commit 69d01cc: |
438799f
to
a87bcd4
Compare
/preview |
Error: User cdegree is not yet permitted to use GitGitGadget |
d0169bf
to
f64a3ef
Compare
2208317
to
ddf766c
Compare
It is known that whether the highest bit is extended when char cast to uint32, depends on CPU architecture, which will lead different hash value. This is a fix to accord all architecture behaviour. Signed-off-by: Chen Xuewei <316403398@qq.com>
/allow |
User cdegree is now allowed to use GitGitGadget. |
/submit |
Submitted as pull.1636.git.git.1704376606625.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Taylor Blau wrote (reply to this): Hi Chen,
On Thu, Jan 04, 2024 at 01:56:46PM +0000, Chen Xuewei via GitGitGadget wrote:
> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.
Thanks for your patch. A similar fix is being pursued in [1], part of
which includes [2], which I believe is functionally equivalent to your
patch here.
> Others
> ======
>
> after fixed the bug, the historical bloom_filter data stored in
> commit-graph need to be updated. because the path's hash value is
> already calculated through a bad way. so we need to update it. this need
> to be done in repository
We would not want to impose that burden on all users upon upgrading to
the latest Git version. In [1] we are perusing an approach where:
- The Bloom data is stored with a version identifier, meaning that we
can still use the existing/non-murmur3 Bloom filters after
upgrading.
- When the user decides to upgrade from v1 -> v2 Bloom filters, we
reuse the existing Bloom filter data when possible, namely when all
paths within a tree have no non-ASCII characters.
If you have thoughts on the approach in [1], they would be most welcome.
Thanks,
Taylor
[1]: https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/f6ab427ead86bc82284b2c721f3c177947ece3c9.1697653929.git.me@ttaylorr.com/ |
User |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Chen Xuewei via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.
>
> Signed-off-by: Chen Xuewei <316403398@qq.com>
> ---
Jonathan and Taylor, isn't this what you two were working together
on? How would we want to proceed?
Chen, using the right implementation of the hash function to be used
after the next rebuild of the Bloom data has so far been treated as
only one part of the solution (the others include "how to deal with
the existing data? do we have a way to tell our binary to safely
ignore the Bloom data using a wrong hash?" and "how to make sure old
binaries do not get confused by the Bloom data using the right/new
hash?").
Jonathan and Taylor's (stalled) effort is here
https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com
Thanks.
|
On the Git mailing list, Taylor Blau wrote (reply to this): On Thu, Jan 04, 2024 at 10:12:42AM -0800, Junio C Hamano wrote:
> Jonathan and Taylor, isn't this what you two were working together
> on? How would we want to proceed?
They are indeed similar. I think that Jonathan and my series would
supersede this effort.
But I would appreciate if Chen took a look at the approach in that
series to make sure that we're all on the same page and that Jonathan
and I aren't missing anything.
Thanks,
Taylor |
Short Description
fix: platform accordance while calculating murmur3
It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.
Problem backgroud:
when using git log --max-count=1 <commit> -- <path> in an mixed cpu cluster environment
both arm and x86 in a cluster as a service, where the <path> character is chinese or some other character
that the highest bit of char is 1.
all machines share the same repo disk.
It happened that sometimes you can get the searched file among commit, sometimes you cannot.
Conditions
Reason
when you have over 2 machines (both arm and x86 are included at least one) as a git server cluster.
once you open the commit-graph's bloom_filter feature.
The bloom filter stores the file path as hash values using the murmur3 function.
suppose the arm take it this time, then the char's highest bit is not extended.
for example,
on arm, char(11100110) to uint32(00000000 00000000 00000000 11100110)
on x86, char(11100110) to uint32(11111111 11111111 11111111 11100110)
then according to the murmur3 function that git currently use,
the calculated hash value will be different.
If the value was calculated through the same cpu architure machine, then it is ok.
however, sometimes the hash value is calculated through a different cpu architure machine,
then you cannot get the searched file.
for example,
bloom_filter's hash set is calculated through arm, and query through x86.
so the hash value is incorrect, then missed the searched file.
Solution
No matter what the highest 24 bits will be when char cast to uint32, the murmur3 function only cares about the char part , which is only the lowest 8 bits, so we can use & 0xFF(11111111) to the casted uint32 value to choose only the lowest 8 bits.
Others
after fixed the bug, the historical bloom_filter data stored in commit-graph need to be updated.
because the path's hash value is already calculated through a bad way. so we need to update it.
this need to be done in repository
cc: Taylor Blau me@ttaylorr.com