Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: platform accordance while calculating murmur3 #1636

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cdegree
Copy link

@cdegree cdegree commented Jan 3, 2024

Short Description

fix: platform accordance while calculating murmur3

It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.

Problem backgroud:

when using git log --max-count=1 <commit> -- <path> in an mixed cpu cluster environment
both arm and x86 in a cluster as a service, where the <path> character is chinese or some other character
that the highest bit of char is 1.
all machines share the same repo disk.
It happened that sometimes you can get the searched file among commit, sometimes you cannot.

Conditions

  1. file path include chinese characters or other characters that the highest bit is 1.
  2. mixed cpu architecture as a git cluster service

Reason

when you have over 2 machines (both arm and x86 are included at least one) as a git server cluster.
once you open the commit-graph's bloom_filter feature.

The bloom filter stores the file path as hash values using the murmur3 function.
suppose the arm take it this time, then the char's highest bit is not extended.
for example,
on arm, char(11100110) to uint32(00000000 00000000 00000000 11100110)
on x86, char(11100110) to uint32(11111111 11111111 11111111 11100110)
then according to the murmur3 function that git currently use,
the calculated hash value will be different.

If the value was calculated through the same cpu architure machine, then it is ok.
however, sometimes the hash value is calculated through a different cpu architure machine,
then you cannot get the searched file.
for example,
bloom_filter's hash set is calculated through arm, and query through x86.
so the hash value is incorrect, then missed the searched file.

Solution

No matter what the highest 24 bits will be when char cast to uint32, the murmur3 function only cares about the char part , which is only the lowest 8 bits, so we can use & 0xFF(11111111) to the casted uint32 value to choose only the lowest 8 bits.

Others

after fixed the bug, the historical bloom_filter data stored in commit-graph need to be updated.
because the path's hash value is already calculated through a bad way. so we need to update it.
this need to be done in repository

cc: Taylor Blau me@ttaylorr.com

Copy link

Welcome to GitGitGadget

Hi @cdegree, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter. You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <revi.ewer@example.com>, Ill Takalook <ill.takalook@example.net>

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

Copy link

There are issues in commit 0981c03:
Update bloom.c
Commit not signed off

@cdegree cdegree changed the title Update bloom.c fix bug in bloom.c Jan 3, 2024
Copy link

There are issues in commit 69d01cc:
fix bug in bloom.c
Commit not signed off

@cdegree cdegree force-pushed the master branch 3 times, most recently from 438799f to a87bcd4 Compare January 3, 2024 12:10
@cdegree
Copy link
Author

cdegree commented Jan 3, 2024

/preview

Copy link

Error: User cdegree is not yet permitted to use GitGitGadget

@cdegree cdegree changed the title fix bug in bloom.c fix: platform accordance in bloom.c Jan 3, 2024
@cdegree cdegree changed the title fix: platform accordance in bloom.c fix: platform accordance while calculating murmur3 in bloom.c Jan 3, 2024
@cdegree cdegree changed the title fix: platform accordance while calculating murmur3 in bloom.c fix: platform accordance while calculating murmur3 Jan 3, 2024
@cdegree cdegree force-pushed the master branch 2 times, most recently from d0169bf to f64a3ef Compare January 3, 2024 13:20
@cdegree cdegree closed this Jan 3, 2024
@cdegree cdegree reopened this Jan 3, 2024
@cdegree cdegree force-pushed the master branch 2 times, most recently from 2208317 to ddf766c Compare January 3, 2024 13:28
It is known that whether the highest bit is extended when char cast to
uint32, depends on CPU architecture, which will lead different hash
value. This is a fix to accord all architecture behaviour.

Signed-off-by: Chen Xuewei <316403398@qq.com>
@rimrul
Copy link
Contributor

rimrul commented Jan 3, 2024

/allow

Copy link

User cdegree is now allowed to use GitGitGadget.

@cdegree
Copy link
Author

cdegree commented Jan 4, 2024

/submit

Copy link

Submitted as pull.1636.git.git.1704376606625.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1636/cdegree/master-v1

To fetch this version to local tag pr-git-1636/cdegree/master-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1636/cdegree/master-v1

Copy link

On the Git mailing list, Taylor Blau wrote (reply to this):

Hi Chen,

On Thu, Jan 04, 2024 at 01:56:46PM +0000, Chen Xuewei via GitGitGadget wrote:
> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.

Thanks for your patch. A similar fix is being pursued in [1], part of
which includes [2], which I believe is functionally equivalent to your
patch here.

>     Others
>     ======
>
>     after fixed the bug, the historical bloom_filter data stored in
>     commit-graph need to be updated. because the path's hash value is
>     already calculated through a bad way. so we need to update it. this need
>     to be done in repository

We would not want to impose that burden on all users upon upgrading to
the latest Git version. In [1] we are perusing an approach where:

  - The Bloom data is stored with a version identifier, meaning that we
    can still use the existing/non-murmur3 Bloom filters after
    upgrading.

  - When the user decides to upgrade from v1 -> v2 Bloom filters, we
    reuse the existing Bloom filter data when possible, namely when all
    paths within a tree have no non-ASCII characters.

If you have thoughts on the approach in [1], they would be most welcome.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/f6ab427ead86bc82284b2c721f3c177947ece3c9.1697653929.git.me@ttaylorr.com/

Copy link

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Chen Xuewei via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Chen Xuewei <316403398@qq.com>
>
> It is known that whether the highest bit is extended when char cast to
> uint32, depends on CPU architecture, which will lead different hash
> value. This is a fix to accord all architecture behaviour.
>
> Signed-off-by: Chen Xuewei <316403398@qq.com>
> ---

Jonathan and Taylor, isn't this what you two were working together
on?  How would we want to proceed?

Chen, using the right implementation of the hash function to be used
after the next rebuild of the Bloom data has so far been treated as
only one part of the solution (the others include "how to deal with
the existing data?  do we have a way to tell our binary to safely
ignore the Bloom data using a wrong hash?" and "how to make sure old
binaries do not get confused by the Bloom data using the right/new
hash?").

Jonathan and Taylor's (stalled) effort is here

    https://lore.kernel.org/git/cover.1697653929.git.me@ttaylorr.com


Thanks.


Copy link

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jan 04, 2024 at 10:12:42AM -0800, Junio C Hamano wrote:
> Jonathan and Taylor, isn't this what you two were working together
> on?  How would we want to proceed?

They are indeed similar. I think that Jonathan and my series would
supersede this effort.

But I would appreciate if Chen took a look at the approach in that
series to make sure that we're all on the same page and that Jonathan
and I aren't missing anything.

Thanks,
Taylor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants