New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's) #46
Comments
Have you considered publishing the output as Github releases instead of as files in the repo? That would presumably be a lot more kind to everything. It's what I'm doing with covid19db. |
Is there a way I could still download the last pre-removal data? This would be helpful |
@jgoerzen at the moment I don't think GitHub releases would actually work for a couple of reasons:
That being said I am pondering a few options:
There are many advantages / disadvantages to all these, so I'm open to discussions.
Yes, I still have on my local workstation the "original" repository pre-cleaning. Do you need all the output files (including historical outputs), or you only need the last regeneration (that I've done last week)? |
Thank you. Just the last regeneration will be fine. The files I was using are: https://github.com/cipriancraciun/covid19-datasets/raw/5444d3e19eb2556a93e4d9ac4974762d9489fc1b/exports/combined/v1/locations-diff.tsv Any of those options will be fine for me. rsync would probably be the most efficient for many. |
@jgoerzen the files you require are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/), under the |
Thank you! I fetched most of the files. However, the NY Times US counties file is missing, as is the JHU daily file. Thanks again. |
@jgoerzen I've just finished this morning regenerating those datasets (the US counties and daily), and they should be published at the link above. However, as noted in our earlier discussions, generating those now takes about 24 hours (each), so I'll be generating them perhaps once or twice a month. |
[Also in the attention of the following users that have forked my repository at various points: @amirunpri2018, @Dithn, @elektrotiko, @hmpandey, @jgoerzen, @rafaelsabino, @sbw78, @stillnotjoy.]
Update: at the moment all the original, intermediary and derived files (and plots) are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/); see the
readme
in the project for details.Unfortunately early in April I've hit the GitHub 100 GiB repository limit. This is with all my effort to compress the files (with a
git
friendly, i.e. "synchronizable", tool likegzip --rsyncable
orzstd --rsyncable
), and with all my hope that the rest of the "text-only" files would compress nicely withgit
's own packing algorithm (based on deltas).Thus, in order to fix the issue, and start re-generating the datasets, I had to take the following measures:
status.json
that contains only the latest values;However, in the next couple of days I'll republish the output files outside of GitHub, and I'll link them in the
readme
.Thus this repository will contain only:
Moreover, because there are a couple of forks to this repository that contain the old history, and because that still causes troubles with GitHub due to the excessive repository sizes, I would kindly ask those that have forked my repository to either remove their forks, or to reset their histories (and push to their GitHub fork) to the current
master
(that holds the cleaned history).If anyone needs help with how to reset their forks, please comment on this issue, and I'll provide some snippets.
Thanks, and sorry for the trouble (both to GitHub and the fellow users that have forked my repository)!
The text was updated successfully, but these errors were encountered: