Investigate more scalable ways of pulling data for the site #378

17cupsofcoffee · 2020-12-08T22:18:08Z

Currently all of the GitHub and Crates.io data used on the site is retrieved via a clever template macro. This is simple and keeps the build self-contained, but has a few big issues:

As the size of the site grows, we may hit a point where the build will trigger rate limits due to the number of requests. I've had this happen to me locally when developing the templates, and it's a pain.
It makes the site increasingly slow to build, since every build has to refetch the data.
It's hard to maintain/add to, due to the limited logic you can express in a templating engine.

I'm wondering if we might be able to find a better way of grabbing this data (e.g. via an external script or a Rust program). This could also allow us to store the site's data in a nicer format, rather than these massive manually ordered data.toml files.

If we did this, there's more efficient options we could use for pulling the API data:

GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.
Crates.io's index can be accessed via Git, avoiding the API altogether.

This might be overengineering things, but it's worth thinking about, I think!

The text was updated successfully, but these errors were encountered:

nickelc · 2021-01-18T14:57:37Z

GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.

I played around with the graphql explorer.

{
  r1: repository(owner: "ChariotEngine", name: "Chariot") {
    ...repoFields
  }
  r2: repository(owner: "duysqubix", name: "MuOxi") {
    ...repoFields
  }
}

fragment repoFields on Repository {
  url
  homepageUrl
  description
}

Crates.io's index can be accessed via Git, avoiding the API altogether.

Afaik the index contains only the versions,name,deps,features,yanked.
But https://crates.io/data-access mentions a database dump that is updated every 24h.
The tarball contains a crates.csv that could be processed to get the description, repository_url, homepage etc.

nickelc · 2021-01-25T12:33:48Z

i wrote a script to combine the data from crates.io and the github's graphql api into a single csv file.

Convert `content/ecosystem/data.toml` to `data.csv`

The categories are joined to a string with : as separator.

toml get content/ecosystem/data.toml items | \
    jq -r 'map(. + { categories: .categories | join(":")}) | (map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' | \
    xsv select name,source,categories,homepage_url,gitter_url > data.csv

Tools: toml-cli, jq, xsv

Generate the final `result.csv`

The crates.csv from db-dump.tar.gz could be cached with the actions/cache action.

result.csv.txt

#!/usr/bin/bash

# Get the names of all github items and also save them for later.
repos=$(xsv search -s source github data.csv | xsv select name | tee names.csv | tail -n +2)

# Build a graphql query for all github repos
i=0
echo "{" > github.query
for r in ${repos[@]}; do
    name=$(echo $r | cut -d "/" -f1)
    repo=$(echo $r | cut -d "/" -f2)
    cat <<QUERY >> github.query
  r$i: repository(owner: "${name}", name: "${repo}") {
    ...repoFields
  }
QUERY
    let i=${i}+1
done
cat <<TAIL >> github.query
}

fragment repoFields on Repository {
  description
  repository: url
  homepage: homepageUrl
}
TAIL

# Execute graphql query and tranform the result to csv plus the names.csv from before.
gh api graphql -f query="$(cat github.query)" | \
    jq -r '[.data[]] |
        (map(keys) | add | unique) as $cols |
        map(. as $row | $cols | map($row[.])) as $rows |
        $cols, $rows[] | @csv' | xsv cat columns names.csv - > github.csv

# Join the github data
xsv join name data.csv name github.csv | xsv select '!name[1]' > joined-github.csv

# Select the needed columns from db-dump.tar.gz's crates.csv
xsv select name,description,repository,homepage crates.csv > partial-crates.csv

# Join the crates.io data
xsv join name data.csv name partial-crates.csv | xsv select '!name[1]' > joined-crates.csv

# Concat rows and sort by name
xsv cat rows joined-crates.csv joined-github.csv | xsv sort -s name > result.csv

Tools: github-cli, jq, xsv

17cupsofcoffee mentioned this issue Apr 21, 2022

Reduce load on Crates.io #478

Merged

17cupsofcoffee mentioned this issue Jun 28, 2023

How do we remove crates? #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate more scalable ways of pulling data for the site #378

Investigate more scalable ways of pulling data for the site #378

17cupsofcoffee commented Dec 8, 2020 •

edited

nickelc commented Jan 18, 2021

nickelc commented Jan 25, 2021

Investigate more scalable ways of pulling data for the site #378

Investigate more scalable ways of pulling data for the site #378

Comments

17cupsofcoffee commented Dec 8, 2020 • edited

nickelc commented Jan 18, 2021

nickelc commented Jan 25, 2021

Convert content/ecosystem/data.toml to data.csv

Generate the final result.csv

17cupsofcoffee commented Dec 8, 2020 •

edited

Convert `content/ecosystem/data.toml` to `data.csv`

Generate the final `result.csv`