Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files-to-artifacts database / API / mapping #54

Open
2 tasks
jaimergp opened this issue Feb 22, 2024 · 3 comments
Open
2 tasks

Files-to-artifacts database / API / mapping #54

jaimergp opened this issue Feb 22, 2024 · 3 comments

Comments

@jaimergp
Copy link
Contributor

jaimergp commented Feb 22, 2024


Provide a way for users to find which package(s) provide a certain file (e.g. a header, or a library, or an executable), similar to what portals like pkgs.org do.

We do have the info in the database designed in https://github.com/Quansight-Labs/conda-forge-db, but we need to serve it somewhere, preferably serverless or close-to-zero maintenance (e.g. one-click deployment). This is tricky because populating the database from scratch has a non-negligible overhead.

Tasks

  1. area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
    zklaus
  2. area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
    jaimergp
@jaimergp
Copy link
Contributor Author

jaimergp commented Mar 5, 2024

We talked with Matt last week and we may be able to unblock this. The main issue is deployment and maintenance of infrastructure. We have several venues to explore:

  • Free managed database services like Oracle Cloud's, coupled with a self-updating GH Releases page that serves the sqlite dumps (see examples in https://github.com/pypi-data/pypi-json-data)
  • @zklaus is looking into using the Git object database as the main driver, without actual files. We have allocated 4h to explore a prototype in terms of performance / usability. Check if https://github.com/github/git-sizer complains if/when/once ready.

@jaimergp
Copy link
Contributor Author

jaimergp commented Mar 5, 2024

@zklaus shared some progress about the git-db prototype in today's mgmt call. Can you add some summary here? 🙏

Also some numbers to give an idea of the scale we are dealing with:

  • 1,602,023 artifacts
  • 18,390,176 unique paths
  • 618,908,726 path-to-artifact relationships
  • A naive dump in a json-path-to-data table in sqlite takes 61GB uncompressed. Down to 2.1GB with zst compression.

@zklaus
Copy link

zklaus commented Mar 6, 2024

The main idea is to store the mapping in a bare git repository. By using libgit2 via its Python binding pygit2 we avoid the need to create a huge tree on the filesystem. I have created a prototype at https://github.com/zklaus/cfgraphman which is able to add individual artifacts from their json info to the Git odb. It remains to be seen how this scales, which will be subject of further investigation over no more than this and the next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants