Files-to-artifacts database / API / mapping #54

jaimergp · 2024-02-22T14:38:13Z

Comes from Design and implement a database for the conda-forge graph and relevant metadata #5

Provide a way for users to find which package(s) provide a certain file (e.g. a header, or a library, or an executable), similar to what portals like pkgs.org do.

We do have the info in the database designed in https://github.com/Quansight-Labs/conda-forge-db, but we need to serve it somewhere, preferably serverless or close-to-zero maintenance (e.g. one-click deployment). This is tricky because populating the database from scratch has a non-negligible overhead.

Tasks

Give feedback

Git-based files-to-artifacts database deployment prototype #58

area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
Sqlite-based files-to-artifacts database deployment #59

area: data 🔢 area: devops 🏗 funding: czi mission: infra 🛠 team: quansight-labs type: task
Options

jaimergp · 2024-03-05T10:31:38Z

We talked with Matt last week and we may be able to unblock this. The main issue is deployment and maintenance of infrastructure. We have several venues to explore:

Free managed database services like Oracle Cloud's, coupled with a self-updating GH Releases page that serves the sqlite dumps (see examples in https://github.com/pypi-data/pypi-json-data)
@zklaus is looking into using the Git object database as the main driver, without actual files. We have allocated 4h to explore a prototype in terms of performance / usability. Check if https://github.com/github/git-sizer complains if/when/once ready.

jaimergp · 2024-03-05T17:24:52Z

@zklaus shared some progress about the git-db prototype in today's mgmt call. Can you add some summary here? 🙏

Also some numbers to give an idea of the scale we are dealing with:

1,602,023 artifacts
18,390,176 unique paths
618,908,726 path-to-artifact relationships
A naive dump in a json-path-to-data table in sqlite takes 61GB uncompressed. Down to 2.1GB with zst compression.

zklaus · 2024-03-06T13:35:00Z

The main idea is to store the mapping in a bare git repository. By using libgit2 via its Python binding pygit2 we avoid the need to create a huge tree on the filesystem. I have created a prototype at https://github.com/zklaus/cfgraphman which is able to add individual artifacts from their json info to the Git odb. It remains to be seen how this scales, which will be subject of further investigation over no more than this and the next week.

jaimergp added type: task team: quansight-labs mission: infra 🛠 area: data 🔢 funding: czi labels Feb 22, 2024

jaimergp added this to the 18 months milestone Feb 22, 2024

This was referenced Feb 22, 2024

Design and implement a database for the conda-forge graph and relevant metadata #5

Closed

Enhancements to feedstocks' verification and validation workflows #8

Open

jaimergp mentioned this issue Feb 22, 2024

Serve Python import maps outside libcfgraph #53

Closed

jaimergp modified the milestones: 18 months, 24 months Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files-to-artifacts database / API / mapping #54

Files-to-artifacts database / API / mapping #54

jaimergp commented Feb 22, 2024 •

edited

Tasks

jaimergp commented Mar 5, 2024 •

edited

jaimergp commented Mar 5, 2024 •

edited

zklaus commented Mar 6, 2024

Files-to-artifacts database / API / mapping #54

Files-to-artifacts database / API / mapping #54

Comments

jaimergp commented Feb 22, 2024 • edited

Tasks

jaimergp commented Mar 5, 2024 • edited

jaimergp commented Mar 5, 2024 • edited

zklaus commented Mar 6, 2024

jaimergp commented Feb 22, 2024 •

edited

jaimergp commented Mar 5, 2024 •

edited

jaimergp commented Mar 5, 2024 •

edited