Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bioresources should be standard text files, version controlled, directly readable, etc. #743

Open
kwalcock opened this issue Mar 31, 2021 · 5 comments

Comments

@kwalcock
Copy link
Member

There are probably good reasons that the bioresources are stored in gzip files, but maybe it's time to revisit them. It is incredibly difficult (for people spoiled by large hard drives and fast network connections, etc.) to do very useful things with them like observe how they have changed over time or even just read them. Only one of the files, uniprot-proteins.tsv, expands to a size larger than the 100MB limit that GitHub imposes. Although there are probably other repercussions, it's just a text file and could be easily split into two parts. If need be, there are ways to create gzip files for deployment during the packaging process. The files we have in kb/ner aren't very large, so it seems like that shouldn't be necessary. It would be so great if they were just there like all the other files.

@bgyori
Copy link
Contributor

bgyori commented Mar 31, 2021

I've been dreaming about this for a long time! Looking at the files with vi has worked for me without decompression but comparing versions for diffs is really a pain. I think the only issues are with the file size limit and the fact that at the level of interacting with the repo itself, things would get more bulky and a bit slower (if there are large diffs being carried around in the git history).

@MihaiSurdeanu
Copy link
Contributor

This was my call at the time because of file size limits in github. If we can uncompress files and still push them, I am all in favor!

@enoriega
Copy link
Member

I did a quick test, the repo's size is 778 MB with gziped files and 992 MB without compression. This is barely under the 1GB limit of the free tier. However, I am not sure if trying to push unzipped files will accumulate the sizes due to versioning or if the quota is computed by the size of HEAD.

I will fork it and test it on my personal account

Another option is to use GitHub's Large File Storage (LFS) and pay $5 a month. That gives us 50GB storage for the repo.

@kwalcock
Copy link
Member Author

Ah, I wasn't aware of a per repo limit. Are you sure? https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota and https://stackoverflow.com/questions/38768454/repository-size-limits-for-github-com mention other numbers. When I've checked the LFS possibility before, there was a troublesome data transfer limit to worry about. Even if there is enough space, moving the data back and forth might still be a problem.

@enoriega
Copy link
Member

You're right @kwalcock. It is a recommended size. I did the test in my personal fork. Uncompressed all the gz files. There is one, uniprot-proteins.tsv that exceeds the 100 MB hard limit on per-file size. However, it can be split into multiple files and pushed this way. It worked well.

Of course, this will require some refactoring of bioresources to account for the split, which shouldn't be too complicated ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants