Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple source code to integrate #550

Open
Cyan4973 opened this issue Aug 9, 2021 · 23 comments
Open

Simple source code to integrate #550

Cyan4973 opened this issue Aug 9, 2021 · 23 comments
Milestone

Comments

@Cyan4973
Copy link
Owner

Cyan4973 commented Aug 9, 2021

One of the strong points of xxhash is its integration story : it's enough to grab a single file xxhash.h and there you have the entire library available in your program.

However, this only applies to people who trust the source code. In some cases, when a programmer feels he should inspect the source code before integration, the large source size of xxhash.h can feel daunting by itself.

One idea could be to split the source code into more dedicated units, resulting in more focused code bases. If someone is only interested in XXH32() or XXH64(), they could integrate, or even copy/paste, just this code.
A good example of this strategy is Stefan Brumme's simple xxHash C++ libraries.
To reach the same level of clarity,XXH3 would have to be split even further, alongside its vector variants.
And of course, one-shot and streaming variants would preferably be separated too.

This could work. But it would also result in a more complex multi-files code base, making the integration story more difficult.

A potential work-around could be to split the source code into these dedicated units, and then have an automated build system which regenerates the "complete" amalgamated xxhash.h library, so that current user base can still benefit from the 1-file integration story.

Splitting is not too difficult, though requires quite some work, hence some available time.
The amalgamation process can be a bit more tricky, especially if an important objective is that xxhash.h must remain an approachable source file which can be read and debugged, as readability experience can be hurt by a dumb amalgamation process.

Supersedes #543.

@t-mat
Copy link
Contributor

t-mat commented Aug 27, 2021

A potential work-around could be to split the source code into these dedicated units, and then have an automated build system which regenerates the "complete" amalgamated xxhash.h library, so that current user base can still benefit from the 1-file integration story.

I personally +1 for amalgamation and relatively modular code base.

I also suppose its "root header" may look like this:

/* xxhash-root.h */
/* Here's copyright notice, etc. */
#if defined (__cplusplus)
extern "C" {
#endif

#define XXH_VERSION_MAJOR    0
#define XXH_VERSION_MINOR    8
#define XXH_VERSION_RELEASE  1
#define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
#include "xxhash-inline-all.h"          /* INLINE mode */

#ifndef XXHASH_H_5627135585666179
#  define XXHASH_H_5627135585666179 1
#  include "xxhash-api.h"               /* Stable API */
#  include "xxhash-xxh32-decl.h"        /* 32-bit hash */
#  ifndef XXH_NO_LONG_LONG
#    include "xxhash-xxh64-decl.h"
#    include "xxhash-xxh3-decl.h"
#  endif
#endif

#if defined(XXH_STATIC_LINKING_ONLY) && !defined(XXHASH_H_STATIC_13879238742)
#  define XXHASH_H_STATIC_13879238742
#  include "xxhash-xxh32-static.h"      /* XXH32_state_s */
#  ifndef XXH_NO_LONG_LONG
#    include "xxhash-xxh64-static.h"
#    include "xxhash-xxh3-static.h"
#  endif
#  if defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API)
#    define XXH_IMPLEMENTATION
#  endif
#endif

#if ( defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API) || defined(XXH_IMPLEMENTATION) ) && !defined(XXH_IMPLEM_13a8737387)
#  define XXH_IMPLEM_13a8737387
#  include "xxhash-impl.h"              /* Tuning parameters, Compiler specific options, etc */
#  include "xxhash-xxh32-impl.h"
#  ifndef XXH_NO_LONG_LONG
#    include "xxhash-xxh64-impl.h"
#    ifndef XXH_NO_XXH3
#      include "xxhash-xxh3-impl.h"
#    endif
#  endif
#endif

#if defined (__cplusplus)
}
#endif

Amalgamation process might be done by the following command. It "expands" every #include "..." files of xxhash-root.h and output the result to stdout.

xxh-amalgum-util.py xxhash-root.h > xxhash.h
  • Q. xxh-amalgum-util emits #file and #line of original file? Or xxhash.h is always "original" source?

Also root Makefile may have entry for it.

The unknown point is when amalgamation will occur. For example,

  • At (1 commit before) every release. (In this case, how xxhsum and test executables will be built? They include "xxhash-root.h"?)
  • By CI for every single commit.
  • Every committer must do it before commit on their side. (By hand or git hook)

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 27, 2021

Yes, that's exactly that.
The full story goes a bit farther, as the final goal would allow selective generation, such as "just" xxh32.h , or "just" xxh64.h",
within these variants, the ability to build "just" the one-shot variants (to keep the code small),
and in the case of XXH3, the ability to select "just" one vectorized implementation.
But the full story is for later.
I think the first step would be exactly the one you suggest, and then we could build upon and improve it.

The unknown point is when amalgamation will occur.

That's a good point.
Initially, my first idea was that this amalgamation process would be triggered by make all, or by something like make xxhash.h.
Thus the committer would just have to invoke make all before submitting his PR.
I don't expect most external committers to think about that, but as long as it's done once in a while, it should not matter (too) much. Of course, it really matters that it's done before every release, and such condition should probably be ensured by a test.

The other idea (ensure that every PR has rebuilt xxhash.h) seems worthwhile too. It's not a "big" effort for committer (as long as building xxhash.h is fast enough), this would make the consistency test compulsory, instead of occasional. Finally, current tests are already built around an up to date xxhash.h, so if they aren't updated, they would not test the "last commit".
So maybe it's a better process after all.

Q. xxh-amalgum-util emits #file and #line of original file? Or xxhash.h is always "original" source?

In which cases does it make a difference ?

@t-mat
Copy link
Contributor

t-mat commented Aug 27, 2021

Q. xxh-amalgum-util emits #file and #line of original file? Or xxhash.h is always "original" source?
In which cases does it make a difference ?

I thought that old version of amalgamed SQLite contained #file and #line of original part of files to support issue report and debugging.
But today I check the code and it doesn't contain any #file and #line. It might be memory lapse.

Anyway, my question doesn't make sense. I must rephrase it.

  • When (external) developers #include amalgamed xxhash.h, how they find actual line of source code during their investigation/contribution?
  • How issue reporting will work with amalgamed xxhash.h?

I don't want to object to the amalgamation. But I'd like to know typical scenario.


The unknown point is when amalgamation will occur.
Initially, my first idea was that this amalgamation process would be triggered by make all, or by something like make xxhash.h.

As for test, I think we can add explicit rebuilding of xxhash.h as a part of checkout procedure.

# Checkout procedure
... git checkout ...
rm xxhash.h
make xxhash.h  # Rebuild

# Actual test
make all test-all
...

We might also have test for real xxhash.h in the repo.

# Checkout procedure
... git checkout ...
# do not remove/rebuild xxhash.h #

# Actual test
make all test-all
...

As for PR/commit process, I suppose typical scenario is:

(0) We always have amalgamed xxhash.h in the repo. (Right?)
(1) Contributors edit separated xxHash (xxhash-xx32-decl.h, etc).
(2) They do make all to check their own change.
(3) make all updates xxhash.h.
(4-1) When they send a PR, it always contains some change for xxhash.h.
(4-2) If someone forget to send amalgamed xxhash.h, other person can revamp it by invoking make xxhash.h.
(4-3) If we need, GA for PR/push is able to force to update xxhash.h by rejecting older xxhash.h. We can implement this logic by comparing xxhash.h in the repo and artifact of make xxhash.h.

@Cyan4973
Copy link
Owner Author

When (external) developers #include amalgamed xxhash.h, how they find actual line of source code during their investigation/contribution?

I currently presume that issues will be reported using line numbers in the file actually included in user's project. aka, generally xxhash.h. In the future, presuming multiple amalgamated versions are possible, it could be xxh32.h, or xxh64.h, etc.
As long as the name of the file is unique, it doesn't really matter if the file is actually "source" or if it's the outcome of amalgamation process.

As for test, I think we can add explicit rebuilding of xxhash.h as a part of checkout procedure.

Good idea, could be attempted.
The only issue to pay attention to is that amalgamation should be a fast process to not burden git checkout.

(0) We always have amalgamed xxhash.h in the repo. (Right?)

Right

(1) Contributors edit separated xxHash (xxhash-xx32-decl.h, etc).

Yes, this one is more difficult.
It will likely require educating contributors,
and a good upfront documentation of how it's supposed to work.

(2) They do make all to check their own change.
(3) make all updates xxhash.h.

Yes

(4-1) When they send a PR, it always contains some change for xxhash.h.

Except if they modify something else, like the CLI code for example.

(4-3) If we need, GA for PR/push is able to force to update xxhash.h by rejecting older xxhash.h. We can implement this logic by comparing xxhash.h in the repo and artifact of make xxhash.h.

Yes

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

I made a pair of small script for PoC.

Demo

  • amalgum-separator.py : Separate xxhash.h into multiple headers.
  • amalgum-expander.py : Expand all #include statements in xxhash-root.h. And output the result to stdout.
  • xxhash.h : Modified xxhash.h which contains special comment /* --amalgum--... */ for amalgum-separator.py.
cd
mkdir xxh-amalgum
cd xxh-amalgum
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/f0a54688f30d5ce0516d8cc7523a8c6a859afee1/amalgum-expander.py
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/f0a54688f30d5ce0516d8cc7523a8c6a859afee1/amalgum-separator.py
curl -LOJ https://gist.githubusercontent.com/t-mat/8414155918bb7595d8c9c43dfadd6b4f/raw/4c12d56530d49c0185b2e4402d1114ee5a560081/xxhash.h

# Check special comments
grep "amalgum" xxhash.h

# split xxhash.h into multiple headers.  base file is xxhash-root.h
python3 ./amalgum-separator.py
ls -l *.h

# expand include statement in xxhash-root.h
python3 ./amalgum-expander.py > test.h
diff xxhash.h test.h

What's the point?

  • Since these scripts provide bidirection operations (split / expand), we can start working with xxhash.h.
  • With these scripts, we can experiment how splitted headers works while keep current structure of xxHash.

amalgum-separator.py

This script recognizes two special comments in xxhash.h

      /* --amalgum--:begin:FILENAME */
      /* --amalgum--:end:FILENAME */

Each begin .. end block is extracted into FILENAME.
Also each block is replaced by the following #include

   #include "FILENAME"

Rest of the content in xxhash.h (outside of begin .. end block) and above #include are written into xxhash-root.h (root_filename in the script).

amalgum-expander.py

This script reads xxhash-root.h and expands (replaces) all #include statement with its file content.

@gzm55
Copy link
Contributor

gzm55 commented Aug 28, 2021

how about using a reverse amalgum way. that is we still only maintain all functions a whole xxhash.h, and do not trigger any addition actions at stages of commit, pr, and ci. but we can provide method/tools to generate some single-function pieces, such as xxhash-32.h, xxhash-64.h, etc, which can be #include as needed by any order.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 28, 2021

how about using a reverse amalgum way

It's an interesting idea,
and I believe this is what @t-mat 's amalgum-separator.py does.

It offers the advantages of a gradual expansion of build scripts able to produce smaller (limited) versions of the library,
without impacting the baseline (complete) source code, thus preserving current development model.

In a way, it answers #543 quite directly.

However, separating the code into multiple small headers also adds maintenance convenience to the list of benefits.
It seems that having smaller, more targeted, header files would likely help readability of said code, by improving local visibility, and likely controlling dependencies in a more formal and granular way. So once this level of granularity is achieved, it seems it would be the preferable code base to work on.

Bidirectional changes feels like an interesting property, though it could also introduce the question: "what's the source of truth?". Note that, if both versions are required to be exactly interchangeable at each commit, I presume it's a moot point at review stage. It's more an issue for the committer, which must decide which version he wants to modify, and update the other one with the appropriate automatic generator. It also impacts make all, which can no longer decide which version is the source of truth, thus bringing the other into conformity (unless a "source of truth" is officially selected).
There is also a risk of increased difficulty in maintaining the bidirectional change ability. I suspect that a single-direction change is probably easier to maintain.

Once again, cutting the source code into multiple smaller files looks like a great boon to maintenance, and for certain deployments, so it feels like the right end-objective. But the resulting amalgamation must still produce a readable xxhash.h, since it's the file that will generally be included into other projects, so if there are questions, or maintenance issues, programmers will want the ability to read, understand and debug this code. A "too dumb" amalgamation process, resulting in an unreadable source file, would kill this property, and with it would impact the willingness to include it in 3rd-party projects.
I guess the bidirectional change ability protects us from such risk, since it's always possible to work on xxhash.h directly, thus ensuring it remains readable.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 28, 2021

@t-mat : I've looked at your PoC demo, and ran it locally.
It's very good, quite simply.
Notably, it preserves the clean bijection between single-header and multiple-focused-headers.
I believe it's a great starting point.

Now, the resulting files from the splitting process are not yet meaningful enough to become reference.
No need to rush, this can be improved progressively.
For example, they would need a license header each.
More critically, they generally depend on other #include in order to make sense of some of their symbols.
I presumed this could be taken care of by expanding the marking capabilities.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

I'm glad it seems we find promising direction. It also brings many further questions/issues though.

"what's the source of truth?"

If we choose separation/amalgamation method, I think this is a core question.
And I believe answer is consequence of gradual transition process. We can divide this transition into some steps:

  • (step-0) Current structure of xxHash.
  • (step-1) Alpha phase. Introduce special comment to xxhash.h. But xxhash.h still represents the "trunk" of the project. We can do some experiment to find good separation and regrouping macros, declarations in xxhash.h. Also these experiments may require improvement of amalgum utility scripts.
  • (step-2) Beta phase. Introduce separation / amalgamation procedure in GA. This GA procedure always do amalgum-separator and amalgum-expander for each commit. But this GA procedure never fail. We can safely observe how amalgamation works. xxhash.h still represents the "trunk".
  • (step-3) Release phase. Separated headers become "trunk" of the project. GA forces amalgamation procedure.

@Cyan4973
Copy link
Owner Author

Yes, that's a good plan.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

As for license, my idea is adding prologue block to amalgum utility.

We introduce special comment --amalgum--:begin-prologue and --amalgum--:end-prologue.

// xxhash.h

/* --amalgum--:begin-prologue */
/*
 * xxHash - Extremely Fast Hash algorithm
 * ...
 */
/* --amalgum--:end-prologue */

Each separated header contains prologue block at the beginning of the file.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

As for dependency, we might introduce dedicated root header file for each variant.

/* xxhash-xxh32-root.h which produces specific variant file xxhash-xxh32.h */
#if defined (__cplusplus)
extern "C" {
#endif

#include "xxhash-inline-mode.h"

#ifndef XXHASH_H_5627135585666179
#  define XXHASH_H_5627135585666179 1
#  include "xxhash-decl.h"
#  include "xxhash-xxh32-decl.h"
#endif /* XXHASH_H_5627135585666179 */

#if defined(XXH_STATIC_LINKING_ONLY) && !defined(XXHASH_H_STATIC_13879238742)
#  define XXHASH_H_STATIC_13879238742
#  include "xxhash-xxh32-static.h"
#endif  /* defined(XXH_STATIC_LINKING_ONLY) && !defined(XXHASH_H_STATIC_13879238742) */

#if ( defined(XXH_INLINE_ALL) || defined(XXH_PRIVATE_API) \
   || defined(XXH_IMPLEMENTATION) ) && !defined(XXH_IMPLEM_13a8737387)
#  define XXH_IMPLEM_13a8737387
#  include "xxhash-tuning.h"
#  include "xxhash-impl.h"
#  include "xxhash-xxh32-impl.h"
#endif  /* XXH_IMPLEMENTATION */

#if defined (__cplusplus)
}
#endif

I don't say this is the best approach since it still contains some unnecessary complexity of #ifs.
But this root header always easily alignes with other variants.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 28, 2021

As for license, my idea is adding prologue block to amalgum utility.

Looks good to me.

As for dependency, we might introduce dedicated root header file for each variant.

That's one way of solving it,
though I guess it also means each root header of each variant should be made (and maintained) manually.

A remaining issue is that, someone opening xxhash-xxh32-impl.h would end up with a file which is not "meaningful" by itself, confusing IDE code analyzers.
This file only makes sense after including xxhash-decl.h, xxhash-decl.h, xxhash-xxh32-static.h, etc. Yet this #include order is not present within xxhash-xxh32-impl.h itself, and only happens from within the root header xxhash-xxh32-root.h.

It's not a big deal as long as xxhash.h is the "source of truth" to edit, but becomes more of a problem later, when the goal is to edit the smaller specific files directly.

One potential solution could be to extend /* --amalgum--:begin:FILENAME */,
for example :

/* --amalgum--:begin:FILENAME 
 * #include "DEPENDENCY1"
 * #include "DEPENDENCY2"
 */

with obviously the intention to add the commented #include in the resulting small file.
One issue with that is that the process to rebuild xxhash.h from the small file should be lossless, aka these #include should end up back into the --amalgum-- comment section. This may require additional mark comments in the small files too.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

I've updated amalgum scripts and example xxhash.h which contains/supports --amalgum--:begin-prologue block.

demo
cd
mkdir xxh-amalgum
cd xxh-amalgum
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/f0a54688f30d5ce0516d8cc7523a8c6a859afee1/amalgum-expander.py
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/f0a54688f30d5ce0516d8cc7523a8c6a859afee1/amalgum-separator.py
curl -LOJ https://gist.githubusercontent.com/t-mat/8414155918bb7595d8c9c43dfadd6b4f/raw/12298ac6d1c0386f26eef469b5963f39b221014e/xxhash.h

# Check special comments
grep "amalgum" xxhash.h

# split xxhash.h into multiple headers.  base file is xxhash-root.h
python3 ./amalgum-separator.py
ls -l *.h

# expand include statement in xxhash-root.h
python3 ./amalgum-expander.py > test.h
diff xxhash.h test.h

Now,

  • xxhash.h has --amalgum--:begin-prologue block.
  • each separated header contains --amalgum--:begin-prologue block which is copied from xxhash.h.
  • amalgum-expander.py ignores --amalgum--:begin-prologue block in separeted header.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

A remaining issue is that, someone opening xxhash-xxh32-impl.h would end up with a file which is not "meaningful" by itself, confusing IDE code analyzers.

As for separated headers, your concern is right.

Perhaps we may rename these headers as *.inl or *.inc which are popular extension to represent these file must be included by other file. (e.g. Explicit inline implementation of C++ methods, functions)

Other solution is requiring every single separated header must be valid C language header.

To satisfy this requirement, separated header might look like this:

/* xxhash-xxh32-impl.h */

/* --amalgum--:begin-ignore */
#include "xxhash-decl.h"
#include "xxhash-xxh32-decl.h"
#include "xxhash-xxh32-static.h"
#include "xxhash-tuning.h"
#include "xxhash-impl.h"
/* --amalgum--:end-ignore */

... actual xxh32 implementation ...

Here, let's say --amalgum--:begin-ignore block will be ignored/removed by amalgumation script.

Since xxhash-xxh32-root.h is also regular/valid C header file, this requirement needs #pragma once or guard macro for each separeted file.

I suppose it might work. But I'm not sure how we can make harmony with INLINE mode or not.

If INLINE mode works fine, I personally prefer to this format. Because every single contributor and IDEs understand them easily.

@Cyan4973
Copy link
Owner Author

Cyan4973 commented Aug 28, 2021

this requirement needs #pragma once or guard macro for each separeted file.

Yes, guard macro per header file are needed

I'm not sure how we can make harmony with INLINE mode or not.

Indeed. INLINE mode starts by #undef all these header guards, since it has to reload all definitions (and implementations) with a different prefix.
I presume a similar strategy would work, even with multiple header files.
But there is an order. The part related to INLINE should be loaded first.

Also, this order only matters for the root file, which produces the selective library.
Regarding individual header files, the only requirement is that they should be parsable (valid) C code.
They are not supposed to be included by a 3rd party project individually.

@t-mat
Copy link
Contributor

t-mat commented Aug 28, 2021

I added special notation for dependency: /* --amalgum--:include:FILENAME */.
This comment must be placed within begin ... end block.

/* --amalgum--:begin:xxhash-xxh32-impl.h */
/* --amalgum--:include:xxhash-inline-mode.h */
/* --amalgum--:include:xxhash-decl.h */
/* --amalgum--:include:xxhash-xxh32-decl.h */
/* --amalgum--:include:xxhash-xxh32-static.h */
/* --amalgum--:include:xxhash-tuning.h */
/* --amalgum--:include:xxhash-impl.h */

It may be just a bit bulky. To reduce number of :include:, we can add automatic dependency resolution functionality.

/* --amalgum--:begin:xxhash-xxh32-impl.h */
/* --amalgum--:include:xxhash-impl.h */ /* all other dependencies are resolved automatically */

But since it may prevent to tuning of dependency and order by hand, so far I'd like to leave it as primitive form.

I also added it to xxhash.h. We can test it with the following commands:

test
cd
mkdir xxh-amalgum
cd xxh-amalgum
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/0062dcbd710ba541dbd4c1a91e195029c5976bd8/amalgum-expander.py
curl -LOJ https://gist.githubusercontent.com/t-mat/0614c9ed6ef7b35dcf00cd022eacfa92/raw/0062dcbd710ba541dbd4c1a91e195029c5976bd8/amalgum-separator.py
curl -LOJ https://gist.githubusercontent.com/t-mat/8414155918bb7595d8c9c43dfadd6b4f/raw/9899dea05aa09f72e5caafed3354fc521f30faf3/xxhash.h

# Check special comments
grep "amalgum" xxhash.h

# split xxhash.h into multiple headers.  base file is xxhash-root.h
python3 ./amalgum-separator.py
ls -l *.h

# expand include statement in xxhash-root.h
python3 ./amalgum-expander.py > test.h
diff xxhash.h test.h

I still don't have conclusion about guard macro for each fragment.

  • They should have plain guard macro in xxhash.h?
  • Or amalgum util should have some functionality?
/* --amalgum--:begin:xxhash-xxh32-impl.h */
/* --amalgum--:guard:XXHASH_XXH32_IMPL */
/* --amalgum--:include:xxhash-inline-mode.h */
...

becomes

/* --amalgum--:begin-prologue */
...
/* --amalgum--:end-prologue */
#ifndef XXHASH_XXH32_IMPL                    /* --amalgum--:guard-line */
#define XXHASH_XXH32_IMPL 1                  /* --amalgum--:guard-line */
/* --amalgum--:begin:xxhash-xxh32-impl.h */
#include "xxhash-inline-mode.h"              /* --amalgum--:include-line */
...
#endif /* XXHASH_XXH32_IMPL */               /* --amalgum--:guard-line */

@t-mat
Copy link
Contributor

t-mat commented Aug 29, 2021

Added --amalgam--:guard:SYMBOL.

/* --amalgam--:begin:xxhash-decl.h */
/* --amalgam--:guard:XXHASH_DECL_H */

becomes

#ifndef XXHASH_DECL_H                        /* --amalgam--:guard-line */
#define XXHASH_DECL_H 1                      /* --amalgam--:ignore-line */
...
#endif /* XXHASH_DECL_H */                   /* --amalgam--:ignore-line */
cd
mkdir xxh-amalgam
cd xxh-amalgam
curl -LOJ https://gist.githubusercontent.com/t-mat/675417703f9956ee73ce0f2b3721edca/raw/f34bdbd4037e6287c9972f39fe5ab3eaf8d8476a/xxhash.h
curl -LOJ https://gist.githubusercontent.com/t-mat/6b289ec10f6130e1a34841b69237a1cd/raw/0e769e68fc800c242ca52ae011da1d425f0ebedb/amalgam-separator.py
curl -LOJ https://gist.githubusercontent.com/t-mat/6b289ec10f6130e1a34841b69237a1cd/raw/0e769e68fc800c242ca52ae011da1d425f0ebedb/amalgam-expander.py

# Check special comments
grep "amalgam" xxhash.h

# split xxhash.h into multiple headers.  base file is xxhash-root.h
python3 ./amalgam-separator.py
ls -l *.h

# expand include statement in xxhash-root.h
python3 ./amalgam-expander.py > test.h
diff xxhash.h test.h

(edit) note : rename amalgum to amalgam.

@tansy
Copy link

tansy commented Dec 29, 2021

There is something I wanted to point out with xxhash. It popped into my head when I was reading about non/bsd hash result convention (#645) and about simple source code to integrate (#550).
I do not understand this inclination to put everything in one, single source file. There are, de facto, two distinct kinds, genera (I would say), of hashes within xxhash. It's xxh and xxh3. I know they are similar in concept, developed by same person/s, similar in working principle, but they are not the same. I like to think of it like in taxonomy:

taxonomy1

xxhash is family of hashes, xxh and and xxh3 are different genera and xxh32, xxh64, xxh128; xxh3-64, xxh3-128 different species within genus.
They are different, separate species nevertheless.


As I don't want to create another issue but have question that is associated with the subject:

Is there any sensible way to compile single hash (by single hash I mean xxh32 or xxh64 or xxh3-128, and so on) other than meticulous, manual carving xxh32 functions from the source?
Imagine I wanted to use only specific hash species* (I mean xxh32 or xxh64) but I didn't want to include 130k of code when I can do with 13k. It's actually what happened when I tried to test new version of xxhash with lz4 and lizard as speedup in hash would most likely improve de/compression speed. The one included in source contain xxh32 and xxh64 and weighs 13k compiled, and xxhash-0.8.1 is 130k compiled. It's more than whole program.

@Cyan4973
Copy link
Owner Author

This is a listed objective, and will require time to achieve.

In the meantime, you can control the scope of the binary by using the following build macros :
for XXH32 only : define XXH_NO_LONG_LONG,
for XXH32 and XXH64 (but not XXH3 nor XXH128) : define XXH_NO_XXH3 .

This is explained and documented here :
https://github.com/Cyan4973/xxHash#build-modifiers

@tansy
Copy link

tansy commented Dec 29, 2021

Thanks, it worked.

@tansy
Copy link

tansy commented Jan 13, 2022

This level of complexity is unnecessary. All we need, in my opinion, is to split it into xxh.h, xxh3.h, and xxhash.h. With xxh.h for be xxh-hash, xxh3.h for xxh3-hash, xxhash.h for basic, common defines and stuff. There is no need to split it into hundreds of files.

I actually tried to split it but due to all these optimizations it's so complicated that I got stuck, also because of construction like that, which suggests it uses XXH- code (XXH32: XXH_readLE32, XXH64: XXH_readLE64):

XXH_FORCE_INLINE XXH_PUREF XXH64_hash_t XXH3_len_4to8_64b(...)

and

XXH3_len_4to8_64b(const xxh_u8* input, size_t len, const xxh_u8* secret, XXH64_hash_t seed)
    (...)
    xxh_u32 const input1 = XXH_readLE32(input);

Maybe some of these functions has to be "duplicated" and have separate XXH and XXH3 "versions".

All I achieved was to remove everything enclosed within XXH_NO_XXH3 define (xxh__NO_XXH3.h) and moved it to separate file (xxh3__YES_XXH3.h), if it's of any help.

Ed: corrected xxh.h, xxh3.h; I left #ifndef XXH_NO_XXH3 (...) #endif clauses so it can be seen what was re/moved.

@Cyan4973
Copy link
Owner Author

Coming back to this topic,
I still believe that having smaller files to edit would be helpful for xxhash.
It would also make conditional amalgamation a real possibility,
resulting in smaller produced binaries, and smaller source code,
which would better fit user's use cases.

I know this topic has already been attempted, though not completed, likely due to complexity.

While I don't remember all the details, I suspect one issue which impacted complexity was an attempt to have a both-ways conversion, splitting the initial xxhash.h into smaller components, then re-integrating them into an amalgamated file, and trying to preserve parity during round-trip. This would have allowed continued development on the "big" amalgamated file during the transition period.

I now suspect that this objective directly lead to increased complexity of the solution, which in turn meant it was not completed.

I'm now considering a simpler approach, which would be strictly one-direction, aka from small files to amalgamated xxhash.h only, without the reverse splitting operation. I presume it would be easier to execute.

Of course, there are a few drawbacks too, and it would be better to underline them first.

To begin with, it must be clear that contributors should no longer edit the main xxhash.h file at root, but rather the more specific ones in a lib subdirectory.
This could be done by adding some code documentation into xxhash.h, making it clear that it is an automatically generated artifact, and therefore should not be edited.

Another issue is interface documentation, since the content of xxhash.h might become messier, and therefore more difficult to read by users.
It might be possible to redirect readers towards the small files, which would be more focused, and as a consequence likely easier to read.
Another mitigation is to have a carefully managed amalgamation process, which ensures that user-level API are clearly delimited at the top of the produced xxhash.h file.

Finally, there is also the question of the "transition period", during which it's not easy to integrate new changes while the source code get dismantled. There is also an inherent complexity in deciding how to split xxhash.h.
On the first point, I think we have reached a stage where the nb of changes is no longer that frequent, so it seems it would be possible to achieve that objective during a "calm" period.
On the second point, maybe the first splitting does not need to be perfect, and one could start with just a "coarse" split. For example, as suggested earlier, we could start by just splitting xxh3 from simpler xxh32 and xxh64. Further splitting refinements could then be executed later.

Note that I don't expect to start such an effort "immediately".
Rather, I would prefer to complete the release of v0.8.2 before considering this next stage. It could be large enough to deserve a v0.9.0 tagging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants