Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

Open
cipriancraciun opened this issue Aug 5, 2021 · 7 comments

Comments

@cipriancraciun
Copy link
Owner

cipriancraciun commented Aug 5, 2021

[Also in the attention of the following users that have forked my repository at various points: @amirunpri2018, @Dithn, @elektrotiko, @hmpandey, @jgoerzen, @rafaelsabino, @sbw78, @stillnotjoy.]


Update: at the moment all the original, intermediary and derived files (and plots) are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/); see the readme in the project for details.


Unfortunately early in April I've hit the GitHub 100 GiB repository limit. This is with all my effort to compress the files (with a git friendly, i.e. "synchronizable", tool like gzip --rsyncable or zstd --rsyncable), and with all my hope that the rest of the "text-only" files would compress nicely with git's own packing algorithm (based on deltas).

Thus, in order to fix the issue, and start re-generating the datasets, I had to take the following measures:

  • I've rewritten the history to remove all output files (binary or text), all with the exception of status.json that contains only the latest values;
  • I've also removed the plots, that changed quite dramatically on each regeneration; (and thus didn't pack nicely;)
  • (none of these files will be added in the future to this repository;)

However, in the next couple of days I'll republish the output files outside of GitHub, and I'll link them in the readme.

Thus this repository will contain only:

  • the sources and scripts to process and augment the data;
  • the input files as found on JHU / NY Times / ECDC repositories; (I've opted to keep these in case the original sources are changed or dissapear; the output files can always be re-generated; these files can't be recreated once they dissapear;)

Moreover, because there are a couple of forks to this repository that contain the old history, and because that still causes troubles with GitHub due to the excessive repository sizes, I would kindly ask those that have forked my repository to either remove their forks, or to reset their histories (and push to their GitHub fork) to the current master (that holds the cleaned history).

If anyone needs help with how to reset their forks, please comment on this issue, and I'll provide some snippets.

Thanks, and sorry for the trouble (both to GitHub and the fellow users that have forked my repository)!

@jgoerzen
Copy link

jgoerzen commented Aug 6, 2021

Have you considered publishing the output as Github releases instead of as files in the repo? That would presumably be a lot more kind to everything. It's what I'm doing with covid19db.

@jgoerzen
Copy link

jgoerzen commented Aug 6, 2021

Is there a way I could still download the last pre-removal data? This would be helpful

@cipriancraciun
Copy link
Owner Author

@jgoerzen at the moment I don't think GitHub releases would actually work for a couple of reasons:

  • I used to have lots of outputs, from charts, to JSON files for each country, to the various all-in-one TSV, SQL, Sqlite3 DB, JSON, etc.; uploading that many files as GitHub releases would be counterproductive I think; (as a figure there are ~2K files and ~2.9K plots;)
  • the total size of all these outputs is around ~30 GiB, thus bundling them in a tar seems also counter productive;
  • for my own workflow I'll need to use something externally to GitHub, thus also integrating GitHub releases (in an automatic fashion) would just add more development time;

That being said I am pondering a few options:

  • plain files served via HTTPS; (I could also provide a few index.txt files so that one can automate downloading with wget;)
  • a Git repository served via HTTPS, outside of GitHub;
  • a rsync repository accessible over an rsync:// endpoint;

There are many advantages / disadvantages to all these, so I'm open to discussions.


Is there a way I could still download the last pre-removal data? This would be helpful

Yes, I still have on my local workstation the "original" repository pre-cleaning.

Do you need all the output files (including historical outputs), or you only need the last regeneration (that I've done last week)?

@cipriancraciun
Copy link
Owner Author

@jgoerzen the files you require are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/), under the exports folder with a similar structure as it previously was on GitHub. Just replace the https://github.com/.../raw/master/ with the URL above.

@jgoerzen
Copy link

Thank you! I fetched most of the files. However, the NY Times US counties file is missing, as is the JHU daily file. Thanks again.

@cipriancraciun
Copy link
Owner Author

@jgoerzen I've just finished this morning regenerating those datasets (the US counties and daily), and they should be published at the link above.

However, as noted in our earlier discussions, generating those now takes about 24 hours (each), so I'll be generating them perhaps once or twice a month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants