zstd / conda format not enforcing file order in inner archives? #181

wolfv · 2022-12-19T06:50:55Z

Checklist

I added a descriptive title
I searched open reports and couldn't find a duplicate

What happened?

I think it would be good to enforce some file order in the inner archives of the .conda format (could be alphabetical or the same weird hash scheme of the .tar.bz2 files) in order to guarantee better reproducibility.

Conda Info

No response

Conda Config

No response

Conda list

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dholth · 2022-12-19T14:39:41Z

Alphabetical seems reasonable. The only requirement for .tar.bz2 is that the info directory should come first. But the old sorting code was not well tested. The current maintainers won't have any attachment to confusing old regex's. See also #172

The new transmute code, if conda-build produces .tar.bz2 instead of directly producing .conda, preserves the .tar.bz2 order.

wolfv · 2022-12-19T14:48:52Z

Great! How about:

sorting info files first (inner sort by file size)
sort the rest of the files alphabetically (using US locale)

And get rid of that wonky sorting code?

dholth · 2022-12-19T14:59:13Z

Why not all alphabetical

Does regular Python string compare use locale?

wolfv · 2022-12-19T15:03:21Z

Yep, we could do that. I think before we wanted small files first to be able to access them at the beginning of the tarball (at least for the info files).
I don't think with todays internet speeds it matters much.

I would need to educate myself about Python string sorting & potential locale issues. I mentioned it because of this page: https://reproducible-builds.org/docs/stable-inputs/

dholth · 2022-12-19T15:34:43Z

It would be interesting to look into the info order, conda index cares about it here https://github.com/conda/conda-index/blob/main/conda_index/index/sqlitecache.py#L237 , https://github.com/conda/conda-index/blob/main/conda_index/index/sqlitecache.py#L28. It makes a difference when the whole info/ is first in the tarball, but the order of the individual info/'s only makes a difference if you can detect you've seen all interesting info.

mbargull · 2022-12-20T10:34:15Z

Making the archive creation more stable/deterministic and in consequence reproducible is a good thing that I support!
However, we should make sure to not unnecessarily constrain ourselves by early decisions here.
Meaning, if you go with a "sort files by size and name/path" strategy here, please don't set it in stone just yet.

Apart from reproducibility, there are other factors, e.g., compression efficiency, to optimize for.
IMO, we should explore (stable) alternatives to the (recently removed) binsort part first. Looking at the similarity hashes and general strategy from https://github.com/mhx/dwarfs#overview would be a good candidate. (I.e., come up with something stable and also performant, which the way binsort was used didn't provide.)

dholth · 2022-12-20T12:22:25Z

It's still reproducible even if you sort it in a weird way, as long as someone else can run the same code on their machine.

mbargull · 2022-12-20T12:31:09Z

It's still reproducible even if you sort it in a weird way, as long as someone else can run the same code on their machine.

Right. What I meant to express was: Let's order the data in a reproducible, yet (compression-wise) efficient way.

wolfv · 2022-12-20T12:43:04Z

I am not very educated on this but do you think that really makes a big difference? Isn't the compression algorithm supposed to take care of that?

wolfv · 2022-12-22T14:48:02Z

I googled and researched a bit -- there is two hashes I found that would work:

simhash and minhash. There is a linux program (simhash) that can be used to order files by similarity.
The binsort utility to order files by similarity (http://neoscientists.org/~tmueller/binsort/)

Some blog article: https://morimori.tokyo/2017/11/sorting-files-for-better-compression/

A paper comparing the two: http://proceedings.mlr.press/v33/shrivastava14.pdf

I think in the best case it could give a 10-20% file size improvement (but that's with older compression algorithms). Not sure if it's worth the added complexity.

dholth · 2023-01-05T19:17:58Z

zstd has a larger window size than bz2 (depending on the compression level and the details are not in the main documentation), if the window size is larger than the entire uncompressed data then order shouldn't significantly affect compression?

dholth · 2023-10-18T21:09:24Z

https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/tarball.py#L16-L30 is a pretty strange decision. It should be changed to info first then alphabetical/lexicographical.

Looks like we don't use any of that logic when building .conda thankfully.

wolfv added the type::bug describes erroneous operation, use severity::* to classify the type label Dec 19, 2022

mbargull mentioned this issue Dec 20, 2022

Using built-in Python hash for sorting goes against reproducibility #178

Closed

2 tasks

mbargull added type::feature request for a new feature or capability and removed type::bug describes erroneous operation, use severity::* to classify the type labels Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd / conda format not enforcing file order in inner archives? #181

zstd / conda format not enforcing file order in inner archives? #181

wolfv commented Dec 19, 2022

dholth commented Dec 19, 2022 •

edited

wolfv commented Dec 19, 2022

dholth commented Dec 19, 2022

wolfv commented Dec 19, 2022

dholth commented Dec 19, 2022

mbargull commented Dec 20, 2022

dholth commented Dec 20, 2022

mbargull commented Dec 20, 2022

wolfv commented Dec 20, 2022

wolfv commented Dec 22, 2022

dholth commented Jan 5, 2023

dholth commented Oct 18, 2023

zstd / conda format not enforcing file order in inner archives? #181

zstd / conda format not enforcing file order in inner archives? #181

Comments

wolfv commented Dec 19, 2022

Checklist

What happened?

Conda Info

Conda Config

Conda list

Additional Context

dholth commented Dec 19, 2022 • edited

wolfv commented Dec 19, 2022

dholth commented Dec 19, 2022

wolfv commented Dec 19, 2022

dholth commented Dec 19, 2022

mbargull commented Dec 20, 2022

dholth commented Dec 20, 2022

mbargull commented Dec 20, 2022

wolfv commented Dec 20, 2022

wolfv commented Dec 22, 2022

dholth commented Jan 5, 2023

dholth commented Oct 18, 2023

dholth commented Dec 19, 2022 •

edited