Visit our website!

Crawler: How does it work?

We have two different services:

The DataService
The CrawlerService

The CrawlerService startsup each hour and makes 5000 calls to the github api, saving everything in timestamped files.

The DataService also starts each hour. It loads all the crawled files from the last hour and saves all the data to the database. So the saved data is around one hour old when it is saved.

There can be 3 different states for the CrawlerService: no data, partial data, full data.

No data state

Start: If there is no data in the database besides the todoInstitutions.

The Crawler just starts with the first Institution that it gets. When an organisation is finished, it gets a timestamp in the todoInstiution collection. When a whole instituion is finished it also gets a timestamp in the collection. The crawler skips all organisations and institutions which timestamps are younger than 7 Days.

Partial data state

Start: If there are already some crawled institutions and organisations

The crawler gets the next institution and/or organisation that was never crawled or which timestamps are older than 7 days. It may happen that 7 Days are not enougth to crawl all data, so there may be some re-crawls of already crawled repos before new ones are crawled.

Full data state

Start: All institutions and organisations were already crawled at least once.

The crawler will just update the data, starting with the oldest timestamp.

System Diagram

Database Stucture

Update Institutions

Once someone has updated the github_repos.json file and the pull request was merged, the new or updated insitution must be added to the database.

This can be done in 4 easy steps:

Pull Repository
Connect to DB with MongoDB Compass. The Credentials can be found in Keepass
Choose the production DB and then todoInstitution collection.
Press Add Data -> Import file and choose github_repos.json. Keep in mind that Stop on errors needs to be unchecked. Then just click Import.

Name		Name	Last commit message	Last commit date
Latest commit History 951 Commits
.github/workflows		.github/workflows
assets		assets
client		client
frontend		frontend
oss-api		oss-api
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Contributing.md		Contributing.md
Dev.md		Dev.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
github_repos.json		github_repos.json
package-lock.json		package-lock.json
package.json		package.json
validate_github_repos.py		validate_github_repos.py

License

digital-sustainability/oss-github-benchmark

Folders and files

Latest commit

History

Repository files navigation

Visit our website!

Crawler: How does it work?

No data state

Partial data state

Full data state

System Diagram

Database Stucture

Update Institutions

About

Resources

License

Stars

Watchers

Forks

Languages