Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move data out of git repository #301

Open
Tracked by #304
ivan-aksamentov opened this issue Apr 29, 2022 · 1 comment
Open
Tracked by #304

Move data out of git repository #301

ivan-aksamentov opened this issue Apr 29, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Apr 29, 2022

Git repo has grown beyond any reasonable size due to large amounts of data committed into it over the months:

$ git gc
$ du -bsch .git
1.4G	.git

The worst offenders are

  • /public/acknowledgements/ (previously /acknowledgements/)
  • /public/proteins/
  • /cluster_tables/
  • /web/data/

As a result:

  • the repo has become very slow - even simple operations, like switching branches or commits can take a few seconds
  • the repo size is now beyond the GitHub's recommended limit of 1 GB

This is not intended use of git. It was not designed for that. And at this rate it is unsustainable to continue this way.

I propose to move data away from the git repo, and to only store the code there.

Data, both the final web data, and intermediate data, can be uploaded to another service, e.g. AWS S3, and/or GitHub Releases, and then fetched from there.

Some of the disadvantages and difficulties of this approach:

  • Data on AWS will not be versioned, so the old data cannot be accessed. However, covariants are made such that the new data is typically a strict superset of the old data. So versioning may not be needed for access.
  • Rollbacks, in case of mistakes or breakage, might not be possible or might be more difficult (e.g. with S3 bucket versioning feature).
  • The web app needs to fetch the web data dynamically, instead of bundling it
  • "Forking" data is not possible or difficult. With git-based flow you could just create a branch and a PR to test the new data. If data is not in git, then the new data need to be hosted somewhere.
  • More work for scientists - they'd need to worry about AWS credentials and data uploading, compared to just git commit.
  • More work for engineers - a separate stateful service, and associated synchronization logic needs to be maintained

We need to figure out an optimal workflow, such that the scientific activities are not disrupted, and that the correctness is fully preserved. Let's discuss this internally on Slack.

These measures will slow down the growth, but they will not make the git repo smaller. So, additionally we may consider to prune the old data forcefully from the git history, or, as a radical measure, to start over an make a new git repo. This will help to make dev experience better.

@ivan-aksamentov ivan-aksamentov added the enhancement New feature or request label Apr 29, 2022
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Apr 29, 2022

An important point is that this, along with Split web data into chunks #303, will break any external usage of the web data (i.e. people on the internet using our JSON files)

Despite we never supported this use-case, and that we don't know most of the downstream users, CoVariants has become an important source of information related to public health, so we need to make this transition graceful by:

  • potentially preserving data separately in the old format (if #303 lands), so that downstream users don't need to make large code changes in order to process it correctly
  • leaving a markdown file in the old locations, to explain downstream users how to transition to the new data source. After they receive a 404 error trying to fetch the data we moved, they will come to the repo and see that there's no JSON file they need, but there's a text file telling them how to adjust.
  • making sure this transition is not too technically involved, e.g. if a simple URL swap is possible, then that should be it
  • notifying downstream users we are aware of, preferably before the breaking change lands to master branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant