Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove blobs from repository history #136

Open
braun-steven opened this issue Nov 6, 2023 · 3 comments
Open

Remove blobs from repository history #136

braun-steven opened this issue Nov 6, 2023 · 3 comments

Comments

@braun-steven
Copy link
Member

braun-steven commented Nov 6, 2023

As observed by @felixdivo, the repository(-history) seems to be polluted with large blobs:

$ git clone git@github.com:SPFlow/SPFlow.git
Cloning into 'SPFlow'...
remote: Enumerating objects: 25226, done.
remote: Counting objects: 100% (5699/5699), done.
remote: Compressing objects: 100% (1544/1544), done.
remote: Total 25226 (delta 4186), reused 5580 (delta 4141), pack-reused 19527
Receiving objects: 100% (25226/25226), 128.74 MiB | 7.17 MiB/s, done.
Resolving deltas: 100% (17655/17655), done.
Updating files: 100% (424/424), done.

I.e., a fresh clone syncs 128 MiB.

Looking at the $N largest blobs, those are mostly datasets and Jupyter notebooks with data attached:

# List all blobs, sorted by size in descending order, display the top N
$ N=10; git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --reverse --key=2 \
| head -n $N \
| awk '{print $2"\t"$1}' \
| while read -r size sha; do \
    line=$(git rev-list --objects --all | grep $sha); \
    echo $size $line; \
  done \
| sort --numeric-sort --reverse --key=1 \
| awk 'BEGIN { OFS="\t"; print "Size", "SHA", "File"} {print $1, $2, $3}'


Size	SHA	File
78400000	dcb8e04c9726c685c87d37ebb9b533171dc5a298	src/spn/data/binary/convex.test.data
73273832	a48113203c301a7ee4ff55865a66b57b96be9873	src/spn/data/categorical/bookmarks/bookmarks.arff
47040016	bbce27659e0fc2b7ed2a64c127849380a477099b	src/spn/data/count/mnist/train-images-idx3-ubyte
23288778	375efd7d894f15b38173ec70a54186ec11b7a683	src/spn/experiments/hyperspectral/cerc15dai175.npz
23051776	09af7e69c90a2e3bcff23f794ecb141f189c4671	src/spn/data/binary/kdd.ts.data
20553260	b5d2937144d0d1340f14c53c807b83f79be21bbd	src/spn/data/binary/c20ng.ts.data
17956533	46bc71904158832b2e940df4f8cae997cd45c1ea	src/spn/tests/parametric_samples/exp_rate_2.csv
17500930	2bad7f5dd511f57fa5ea732c4f291f1fede50bb8	src/spn/tests/parametric_samples/gamma_shape2_scale0.5.csv
17311308	939c8391c77e4764c74553b011ba7ced924337de	src/spn/data/binary/msweb.ts.data
16890649	fc4ce9bff58b54a47a3dfb9d05b5e8cd41dec0a9	src/spn/tests/parametric_samples/norm_mean10_sd3.csv

We should use https://github.com/rtyley/bfg-repo-cleaner to remove these files before v1.0.0.

@braun-steven braun-steven added this to the Version 1.0.0 milestone Nov 6, 2023
@braun-steven braun-steven changed the title [dev] Remove blobs from repository history Remove blobs from repository history Dec 11, 2023
@braun-steven
Copy link
Member Author

braun-steven commented Dec 11, 2023

I've tested it locally with a 1M limit on files. We can shrink the .git dir from 140M to 14M.

❯ bfg --strip-blobs-bigger-than 1M SPFlow-copy/

Using repo : /home/steven/projects/SPFlow-copy/.git

Scanning packfile for large blobs: 26686
Scanning packfile for large blobs completed in 125 ms.
Found 72 blob ids for large blobs - biggest=78400000 smallest=1159000
Total size (unpacked)=599451016
Found 984 objects to protect
Found 54 commit-pointing refs : HEAD, refs/heads/dev_learnSPN, refs/heads/dev_modules, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit d0559f82 (protected by 'HEAD')

Cleaning
--------

Found 1337 commits
Cleaning commits:       100% (1337/1337)
Cleaning commits completed in 765 ms.

Updating 32 Refs
----------------

	Ref                                              Before     After
	--------------------------------------------------------------------
	refs/heads/dev_learnSPN                        | 046b900d | 12e8d32b
	refs/heads/gh-pages                            | a748c667 | f631a248
	refs/heads/learnspn_b_test                     | 32b66ee2 | 49dd5cde
	refs/heads/master                              | d01f71d6 | 29101c0a
	refs/remotes/felixdivo/master                  | ad656c16 | 948b5523
	refs/remotes/origin/dev_learnSPN               | 046b900d | 12e8d32b
	refs/remotes/origin/gh-pages                   | a748c667 | f631a248
	refs/remotes/origin/learnspn_b_test            | 32b66ee2 | 49dd5cde
	refs/remotes/origin/master                     | d01f71d6 | 29101c0a
	refs/remotes/private/add-github-issue-template | c29ae78e | efc68608
	refs/remotes/private/add-sklearn-classifier    | 21c58970 | 2191089f
	refs/remotes/private/add-sphinx-documentation  | 5cac9101 | 19b9213d
	refs/remotes/private/bernoulli-layer           | 8db98c6c | f2270578
	refs/remotes/private/dev_learnSPN              | 046b900d | 12e8d32b
	refs/remotes/private/develop                   | 0f7639e6 | ba08e40c
	...

Updating references:    100% (32/32)
...Ref update completed in 71 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	DDDDDDDDDDDDDDDDDDDDDDDDDDDDD....D.....D......Dm.......D.DDD

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After
	-------------------------------------------
	First modified commit | 19266a9c | 762535f7
	Last dirty commit     | eb843114 | d579598c

Deleted files
-------------

	Filename                       Git id
	------------------------------------------------------------------------
	Corel5k-train.arff           | 7581f88b (7.5 MB)
	Corel5k.arff                 | 413f91ee (8.3 MB)
	accidents.ts.data            | cefacb18 (2.7 MB)
	ad.test.data                 | 1a6ecad2 (1.5 MB)
	ad.ts.data                   | b9dfaba7 (7.3 MB)
	adults.csv                   | ac05b131 (2.9 MB)
	ba_notebook.ipynb            | 00d87385 (4.1 MB)
	baudio.ts.data               | 53b36fd2 (2.9 MB)
	bbc.ts.data                  | 1109ad0c (3.4 MB)
	bern_prob0.7.csv             | 75a822ba (1.9 MB)
	bibtex-test.arff             | c6d43a85 (1.2 MB)
	bibtex-train.arff            | 70c68362 (2.2 MB)
	bibtex.arff                  | f702c066 (3.3 MB)
	bnetflix.ts.data             | 606d88a4 (2.9 MB)
	book.test.data               | 90460f56 (1.7 MB)
	...


In total, 2541 object ids were changed. Full details are logged here:

	/home/steven/projects/SPFlow-copy.bfg-report/2023-12-11/14-31-23

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

~/projects/SPFlow-copy dev_tensorly* ≡
❯ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 26525, done.
Counting objects: 100% (26525/26525), done.
Delta compression using up to 8 threads
Compressing objects: 100% (26101/26101), done.
Writing objects: 100% (26525/26525), done.
Total 26525 (delta 18930), reused 5835 (delta 0), pack-reused 0

~/projects
❯ du -sh SPFlow-copy/.git
14M	SPFlow-copy/.git

~/projects
❯ du -sh SPFlow/.git
140M	SPFlow/.git

Let's do this right before the v1.0.0 release, like #146.

@felixdivo
Copy link
Member

Sounds reasonable. Though you could even consider going down to 500kB or below, as long as we only/mainly delete .arff, .data, ... files.

@braun-steven
Copy link
Member Author

Yeah, I guess we should additionally just filter out any kind of dataset file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Next
Development

No branches or pull requests

3 participants