Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pristine-lfs support #79

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

andrewshadura
Copy link
Contributor

This pull request adds pristine-lfs support in a similar way to the existing pristine-tar support. Since pristine-lfs has a Python interface, I use that instead of the command-line interface, but in a fail-safe way, providing a plug raising an exception when pristine-lfs cannot be imported.

Unlike pristine-tar, pristine-lfs supports looking up tarballs by a version, so this implementation uses that instead of picking tarballs manually.

If both pristine-lfs and pristine-tar are enabled, tarballs are only imported with pristine-lfs; when checking out, pristine-lfs is preferred, but pristine-tar processed anyway if pristine-lfs is missing the necessary tarball.

This PR is still work in progress, it’s missing documentation and tests. Should I maybe add a subcommand for pristine-lfs too?

I’m also unsure what exceptions to raise depending on the error situation.

@agx
Copy link
Owner

agx commented Apr 24, 2021

@andrewshadura thanks! This looks like a useful addition. Thinks i'm unsure about:

This PR is still work in progress, it’s missing documentation and tests. Should I maybe add a subcommand for pristine-lfs too?

Hmm...would it make sense to use gbp pristine-tar for that instead of having the user worry about which tool to use?

I’m also unsure what exceptions to raise depending on the error situation.

That does't look too bad from a quick glance.

@andrewshadura
Copy link
Contributor Author

* prefer lfs over tar or the other way around (and why)? (popcon seems to prefer `pristine-tar`)
  https://qa.debian.org/popcon-graph.php?packages=pristine-tar+pristine-lfs&show_installed=on&want_legend=on&want_ticks=on&from_date=&to_date=&hlght_date=&date_fmt=%25Y-%25m&beenhere=1
* fallback mechanism: what if `pristine-lfs` is better recreating a tarball than `pristine-tar` (or the other way around) when e.g. there's new compressors etc.

So, pristine-lfs and pristine-tar are built on different principles.
What is stored:

  • pristine-tar records the hash of the Git tree the tarball corresponds to, the hash of the tarball itself, and a binary delta between its Git export and the actual tarball
  • pristine-lfs records the hash of the tarball only; the actual tarball is stored and transferred out-of-band using Git LFS mechanisms.
    How tarballs are recreated:
  • pristine-tar only needs the Git repository itself: it uses the Git tree referenced in its metadata and the binary delta to recreate the tarball. It needs to explicitly support each compressor and its quirks to be able to efficiently store the tarball and precisely recreate it
  • pristine-lfs needs the Git repository itself, but it also needs to download binary blobs for each tarball separately. They’re stored inside .git but not as Git objects. When a tarball is checked out, it’s basically downloaded if it’s missing and then copied from the internal storage to the desired location. This means that for off-line operation all tarballs need to be cached beforehand.
    Conclusions:
  • for a lot of tarballs especially with tricky compressors, when you need just one, the bandwidth to recreate it will be less for pristine-lfs than pristine-tar since pristine-lfs branch only contains text metadata but no tarball data — with pristine-tar the branch will contain all binary deltas for all tarballs, whereas pristine-lfs only downloads tarballs on-demand. (On the other hand, the tarball data is already on the upstream code branch.)
  • when you need all tarball ever imported, pristine-tar will probably save you a lot of bandwidth
  • pristine-lfs doesn’t care about compressors or the tarball format
  • pristine-tar doesn’t require any server-side support, pristine-lfs needs a Git LFS server.

Given the above I think the first try should be pristine-lfs (if available) and then pristine-tar as a fallback.

* should there be a simple way to do both on import (pristine-tar and pristine-lfs)

I agree: --pristine-lfs --pristine-tar on import should do both, not just one.

@agx
Copy link
Owner

agx commented Dec 31, 2021

Thanks ore spelling out the details. If we make import-orig smart enough a separate gbp prstine-lfs sounds more useful than somehow folding this into gbp pristine-tar. The only bit missing then is enough config to tell import-orig what to do on import (pristine-tar, pristine-lfs, both) and for export-orig what to prefer/how to fall back. Does that make sense?

@andrewshadura
Copy link
Contributor Author

andrewshadura commented Dec 14, 2022

@agx, I finally found time for this pull request.

Given how long it’s been around — and I’ve been using it all that time! — I think we should merge the minimal dumb version of it without extra smartness, as described in the last message: import to either pristine-tar or pristine-lfs only when told to, when both are enabled, import to both. Export checks both, prefers pristine-lfs. I haven’t added any configuration for that, but I can’t really think about many situations when this might be necessary — unless someone commits different tarballs to both pristine-lfs and pristine-tar.

I have described the approach and the rationale in the commit message for 531971c.

Add pristine-lfs support similar to the existing pristine-tar support.
Since pristine-lfs has a Python interface, use that instead of the
command-line interface, but use it in a fail-safe way.

Unlike pristine-tar, pristine-lfs supports looking up tarballs by a
version, so use that instead of picking tarballs manually.

If both pristine-lfs and pristine-tar are enabled, import tarballs
into both; when checking out, prefer pristine-lfs, but also
do pristine-tar.

The rationale for the above behaviour:

pristine-lfs and pristine-tar are built on different principles.

What is stored:

 * pristine-tar records the hash of the Git tree the tarball corresponds
   to, the hash of the tarball itself, and a binary delta between its Git
   export and the actual tarball
 * pristine-lfs records the hash of the tarball only; the actual tarball
   is stored and transferred out-of-band using Git LFS mechanisms.

How tarballs are recreated:

 * pristine-tar only needs the Git repository itself: it uses the Git
   tree referenced in its metadata and the binary delta to recreate the
   tarball. It needs to explicitly support each compressor and its quirks
   to be able to efficiently store the tarball and precisely recreate it
 * pristine-lfs needs the Git repository itself, but it also needs to
   download binary blobs for each tarball separately. They’re stored
   inside .git but not as Git objects. When a tarball is checked out,
   it’s basically downloaded if it’s missing and then copied from the
   internal storage to the desired location. This means that for off-line
   operation all tarballs need to be cached beforehand.

Conclusions:

 * for a lot of tarballs especially with tricky compressors, when you
   need just one, the bandwidth to recreate it will be less for pristine-lfs
   than pristine-tar since pristine-lfs branch only contains text metadata
   but no tarball data — with pristine-tar the branch will contain all
   binary deltas for all tarballs, whereas pristine-lfs only downloads
   tarballs on-demand. (On the other hand, the tarball data is already on
   the upstream code branch.)
 * when you need all tarball ever imported, pristine-tar will probably
   save you a lot of bandwidth
 * pristine-lfs doesn’t care about compressors or the tarball format
 * pristine-tar doesn’t require any server-side support, pristine-lfs
   needs a Git LFS server.

Given the above, it makes sense that during a checkout, the first try
should be pristine-lfs (if available) and then pristine-tar as a fallback.

During import, --pristine-lfs --pristine-tar on import should do both,
not just one.

This approach may be revisited in future.

Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
@andrewshadura andrewshadura changed the title [WIP] Add pristine-lfs support Add pristine-lfs support Dec 14, 2022
@andrewshadura
Copy link
Contributor Author

By the way, I’ve just noticed the RPM test data repo is gone — is that intentional?

@andrewshadura
Copy link
Contributor Author

@agx, is there any chance to get this merged before the freeze?

@agx
Copy link
Owner

agx commented Feb 3, 2023

Seems github ate my reply: Can we have tests for the new options (enabled via GBP_NETWORK_TESTS) that e.g. take an URL and have a small python-gitlab helper that sets up such a repo on salsa. We'd rather give people an easy way to check if things are still working after making changes when introducing a new way of reproducing tarballs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants