climate-news-db

The climate-news-db has two goals:

create a dataset of climate change newspaper articles for NLP researchers,
provide a web application for users to view climate change news.

Use

Crawling URLs

Pulls urls.jsonl from S3 and crawls articles into articles/{newspaper}.jsonl and into database:

$ make crawl

Regenerate Database

Take urls from articles/{newspaper}.jsonl and saves into database:

$ make regen-db

This is useful when you want to re-create the database without scraping articles.

Interactive Search for Getting URLs

Requires Go + Gum

$ ./scripts/search-cli.sh

Data Artifacts

Lineage

graph LR
  1(urls.jsonl) -->|make crawl| 2(articles.jsonl)
  2(articles.jsonl) -->|make crawl, make regen-db| 3(database)

urls.jsonl

{"url": "https://www.chinadaily.com.cn/a/202302/21/WS63f4aea4a31057c47ebb004e.html", "search_time_utc": "2023-03-20T00:05:02.998560"}
{"url": "https://www.chinadaily.com.cn/a/202301/19/WS63c8a4a8a31057c47ebaa8e4.html", "search_time_utc": "2023-03-20T00:05:02.998560"}

Append only storage of raw newspaper urls. Created by a daily Google search for each newspaper with the keywords climate change and climate crisis. This file contains many duplicates.

Infra

Webapp

Deployed as a Fly.IO app:

$ make deploy

AWS Infra

Deployed with AWS CDK:

$ make aws-infra

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github/workflows		.github/workflows
climatedb		climatedb
docker		docker
infra		infra
poc		poc
scripts		scripts
static		static
templates		templates
tests		tests
.dockerignore		.dockerignore
.envrc		.envrc
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
fly.toml		fly.toml
newspapers.json		newspapers.json
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg
stats.json		stats.json

awesome-software/climate-news-db

Folders and files

Latest commit

History

Repository files navigation

climate-news-db

Use

Crawling URLs

Regenerate Database

Interactive Search for Getting URLs

Data Artifacts

Lineage

urls.jsonl

Infra

Webapp

AWS Infra

About

Topics

Resources

Stars

Watchers

Forks

Languages