Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate more scalable ways of pulling data for the site #378

Open
17cupsofcoffee opened this issue Dec 8, 2020 · 2 comments
Open

Investigate more scalable ways of pulling data for the site #378

17cupsofcoffee opened this issue Dec 8, 2020 · 2 comments

Comments

@17cupsofcoffee
Copy link
Collaborator

17cupsofcoffee commented Dec 8, 2020

Currently all of the GitHub and Crates.io data used on the site is retrieved via a clever template macro. This is simple and keeps the build self-contained, but has a few big issues:

  • As the size of the site grows, we may hit a point where the build will trigger rate limits due to the number of requests. I've had this happen to me locally when developing the templates, and it's a pain.
  • It makes the site increasingly slow to build, since every build has to refetch the data.
  • It's hard to maintain/add to, due to the limited logic you can express in a templating engine.

I'm wondering if we might be able to find a better way of grabbing this data (e.g. via an external script or a Rust program). This could also allow us to store the site's data in a nicer format, rather than these massive manually ordered data.toml files.

If we did this, there's more efficient options we could use for pulling the API data:

  • GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.
  • Crates.io's index can be accessed via Git, avoiding the API altogether.

This might be overengineering things, but it's worth thinking about, I think!

@nickelc
Copy link
Contributor

nickelc commented Jan 18, 2021

GitHub has a GraphQL API, and I believe that might allow us to query multiple repos in a single request.

I played around with the graphql explorer.

{
  r1: repository(owner: "ChariotEngine", name: "Chariot") {
    ...repoFields
  }
  r2: repository(owner: "duysqubix", name: "MuOxi") {
    ...repoFields
  }
}

fragment repoFields on Repository {
  url
  homepageUrl
  description
}

Crates.io's index can be accessed via Git, avoiding the API altogether.

Afaik the index contains only the versions,name,deps,features,yanked.
But https://crates.io/data-access mentions a database dump that is updated every 24h.
The tarball contains a crates.csv that could be processed to get the description, repository_url, homepage etc.

@nickelc
Copy link
Contributor

nickelc commented Jan 25, 2021

i wrote a script to combine the data from crates.io and the github's graphql api into a single csv file.

Convert content/ecosystem/data.toml to data.csv

The categories are joined to a string with : as separator.

toml get content/ecosystem/data.toml items | \
    jq -r 'map(. + { categories: .categories | join(":")}) | (map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' | \
    xsv select name,source,categories,homepage_url,gitter_url > data.csv

Tools: toml-cli, jq, xsv

Generate the final result.csv

The crates.csv from db-dump.tar.gz could be cached with the actions/cache action.

result.csv.txt

#!/usr/bin/bash

# Get the names of all github items and also save them for later.
repos=$(xsv search -s source github data.csv | xsv select name | tee names.csv | tail -n +2)

# Build a graphql query for all github repos
i=0
echo "{" > github.query
for r in ${repos[@]}; do
    name=$(echo $r | cut -d "/" -f1)
    repo=$(echo $r | cut -d "/" -f2)
    cat <<QUERY >> github.query
  r$i: repository(owner: "${name}", name: "${repo}") {
    ...repoFields
  }
QUERY
    let i=${i}+1
done
cat <<TAIL >> github.query
}

fragment repoFields on Repository {
  description
  repository: url
  homepage: homepageUrl
}
TAIL

# Execute graphql query and tranform the result to csv plus the names.csv from before.
gh api graphql -f query="$(cat github.query)" | \
    jq -r '[.data[]] |
        (map(keys) | add | unique) as $cols |
        map(. as $row | $cols | map($row[.])) as $rows |
        $cols, $rows[] | @csv' | xsv cat columns names.csv - > github.csv

# Join the github data
xsv join name data.csv name github.csv | xsv select '!name[1]' > joined-github.csv

# Select the needed columns from db-dump.tar.gz's crates.csv
xsv select name,description,repository,homepage crates.csv > partial-crates.csv

# Join the crates.io data
xsv join name data.csv name partial-crates.csv | xsv select '!name[1]' > joined-crates.csv

# Concat rows and sort by name
xsv cat rows joined-crates.csv joined-github.csv | xsv sort -s name > result.csv

Tools: github-cli, jq, xsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants