Skip to content

Scraper to collect data from the math genealogy project.

Notifications You must be signed in to change notification settings

galbwe/math-genealogy-scraper

Repository files navigation

Math Genealogy Scraper

A project for scraping student/advisor relationships from the Math Genealogy website.

Local Setup

Install Python

  1. Install a recent version of Python. These instructions were verified for Python 3.7.7. Your mileage may vary with other versions. You can check the currently installed version with python --version
  2. Create a virtual environment in the root directory of the project:
    python -m venv venv
  3. Activate the virtual environment with source venv/bin/activate
  4. Upgrade pip: pip install --upgrade pip
  5. Install the project as an editable package by running the following from the project root:
    pip install -e .
  6. Install additional dependencies
    pip install -r requirements.development.txt

Install Docker and Docker Compose

  1. Follow the official installation instructions if you do not already have Docker installed.

Source local environment variables

  1. In the project root directory, create a file called .env with the following contents:
    export ENVIRONMENT="dev"
    export POSTGRES_CONNECTION_DEV="postgresql://postgres:postgres@localhost:5432/postgres"
    

Database Setup

  1. Run a PostgreSQL database server in a docker container by running the following command in the project root directory:
    docker compose up --build -d
  2. Check that docker compose ran correctly with docker ps. You should see two containers running: math-genealogy-scraper-pgadmin-1 and math-genealogy-scraper-postgres-1.
  3. In a new terminal, check that you can connect to the database by running the following command in the project root directory:
    docker compose exec postgres psql -U postgres
  4. Check that the database is in a clean state with no extra tables with:
    \l
    \c postgres
    \dt
    

You should not see any tables with "student" or "advisor" in the name.

  1. Keep the psql terminal running. You will need it in a minute. When you are done, you can exit the psql prompt with \q.

Run Alembic Migrations

  1. cd into the math_genealogy/backend directory and run the following command:
    alembic upgrade head
  2. In your psql cli, check that several new tables were created with \dt.

Run the Scraper

  1. cd into math_genealogy/scrapers and run the following command:
    scrapy crawl math_genealogy
  2. Let the scraper run for a little bit.
  3. You can check on the progress by querying the number of mathematicians and student-advisor relationships in the database:
    SELECT COUNT(*) FROM mathematicians;
    SELECT COUNT(*) FROM student_advisor;

About

Scraper to collect data from the math genealogy project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages