Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

cipriancraciun · 2021-08-05T18:47:59Z

[Also in the attention of the following users that have forked my repository at various points: @amirunpri2018, @Dithn, @elektrotiko, @hmpandey, @jgoerzen, @rafaelsabino, @sbw78, @stillnotjoy.]

Update: at the moment all the original, intermediary and derived files (and plots) are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/); see the readme in the project for details.

Unfortunately early in April I've hit the GitHub 100 GiB repository limit. This is with all my effort to compress the files (with a git friendly, i.e. "synchronizable", tool like gzip --rsyncable or zstd --rsyncable), and with all my hope that the rest of the "text-only" files would compress nicely with git's own packing algorithm (based on deltas).

Thus, in order to fix the issue, and start re-generating the datasets, I had to take the following measures:

I've rewritten the history to remove all output files (binary or text), all with the exception of status.json that contains only the latest values;
I've also removed the plots, that changed quite dramatically on each regeneration; (and thus didn't pack nicely;)
(none of these files will be added in the future to this repository;)

However, in the next couple of days I'll republish the output files outside of GitHub, and I'll link them in the readme.

Thus this repository will contain only:

the sources and scripts to process and augment the data;
the input files as found on JHU / NY Times / ECDC repositories; (I've opted to keep these in case the original sources are changed or dissapear; the output files can always be re-generated; these files can't be recreated once they dissapear;)

Moreover, because there are a couple of forks to this repository that contain the old history, and because that still causes troubles with GitHub due to the excessive repository sizes, I would kindly ask those that have forked my repository to either remove their forks, or to reset their histories (and push to their GitHub fork) to the current master (that holds the cleaned history).

If anyone needs help with how to reset their forks, please comment on this issue, and I'll provide some snippets.

Thanks, and sorry for the trouble (both to GitHub and the fellow users that have forked my repository)!

The text was updated successfully, but these errors were encountered:

jgoerzen · 2021-08-06T13:17:11Z

Have you considered publishing the output as Github releases instead of as files in the repo? That would presumably be a lot more kind to everything. It's what I'm doing with covid19db.

jgoerzen · 2021-08-06T13:20:33Z

Is there a way I could still download the last pre-removal data? This would be helpful

cipriancraciun · 2021-08-07T14:46:25Z

@jgoerzen at the moment I don't think GitHub releases would actually work for a couple of reasons:

I used to have lots of outputs, from charts, to JSON files for each country, to the various all-in-one TSV, SQL, Sqlite3 DB, JSON, etc.; uploading that many files as GitHub releases would be counterproductive I think; (as a figure there are ~2K files and ~2.9K plots;)
the total size of all these outputs is around ~30 GiB, thus bundling them in a tar seems also counter productive;
for my own workflow I'll need to use something externally to GitHub, thus also integrating GitHub releases (in an automatic fashion) would just add more development time;

That being said I am pondering a few options:

plain files served via HTTPS; (I could also provide a few index.txt files so that one can automate downloading with wget;)
a Git repository served via HTTPS, outside of GitHub;
a rsync repository accessible over an rsync:// endpoint;

There are many advantages / disadvantages to all these, so I'm open to discussions.

Is there a way I could still download the last pre-removal data? This would be helpful

Yes, I still have on my local workstation the "original" repository pre-cleaning.

Do you need all the output files (including historical outputs), or you only need the last regeneration (that I've done last week)?

jgoerzen · 2021-08-08T03:21:19Z

Thank you. Just the last regeneration will be fine. The files I was using are:

https://github.com/cipriancraciun/covid19-datasets/raw/5444d3e19eb2556a93e4d9ac4974762d9489fc1b/exports/combined/v1/locations-diff.tsv
https://github.com/cipriancraciun/covid19-datasets/raw/master/exports/ecdc/v1/worldwide/values-sqlite.db.zst
https://github.com/cipriancraciun/covid19-datasets/raw/master/exports/jhu/v1/daily/values-sqlite.db.zst
https://github.com/cipriancraciun/covid19-datasets/raw/master/exports/jhu/v1/series/values-sqlite.db.zst
https://github.com/cipriancraciun/covid19-datasets/raw/master/exports/nytimes/v1/us-counties/values-sqlite.db.zst
https://github.com/cipriancraciun/covid19-datasets/raw/master/exports/nytimes/v1/us-states/values-sqlite.db.zst

Any of those options will be fine for me. rsync would probably be the most efficient for many.

cipriancraciun · 2021-08-08T10:14:26Z

@jgoerzen the files you require are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/), under the exports folder with a similar structure as it previously was on GitHub. Just replace the https://github.com/.../raw/master/ with the URL above.

jgoerzen · 2021-08-10T02:14:58Z

Thank you! I fetched most of the files. However, the NY Times US counties file is missing, as is the JHU daily file. Thanks again.

cipriancraciun · 2021-08-10T05:45:55Z

@jgoerzen I've just finished this morning regenerating those datasets (the US counties and daily), and they should be published at the link above.

However, as noted in our earlier discussions, generating those now takes about 24 hours (each), so I'll be generating them perhaps once or twice a month.

cipriancraciun added visualizations exports labels Aug 5, 2021

cipriancraciun self-assigned this Aug 5, 2021

cipriancraciun pinned this issue Aug 5, 2021

This was referenced Aug 5, 2021

NY Times counties data stupped updating on Jan 17 #45

Closed

Please restore combined datasets (perhaps as releases) #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

cipriancraciun commented Aug 5, 2021 •

edited

jgoerzen commented Aug 6, 2021

jgoerzen commented Aug 6, 2021

cipriancraciun commented Aug 7, 2021

jgoerzen commented Aug 8, 2021

cipriancraciun commented Aug 8, 2021

jgoerzen commented Aug 10, 2021

cipriancraciun commented Aug 10, 2021

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46

Comments

cipriancraciun commented Aug 5, 2021 • edited

jgoerzen commented Aug 6, 2021

jgoerzen commented Aug 6, 2021

cipriancraciun commented Aug 7, 2021

jgoerzen commented Aug 8, 2021

cipriancraciun commented Aug 8, 2021

jgoerzen commented Aug 10, 2021

cipriancraciun commented Aug 10, 2021

cipriancraciun commented Aug 5, 2021 •

edited