Skip to content
This repository has been archived by the owner on Feb 7, 2024. It is now read-only.

theharvardcrimson/q

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

q guide scraper

instructions

config

Set the following environment variables to point PHP to your MySQL database:

  • Q_SCRAPER_DATABASE_HOST
  • Q_SCRAPER_DATABASE_USER
  • Q_SCRAPER_DATABASE_PASSWORD
  • Q_SCRAPER_DATABASE_NAME

install tables

Execute tables.sql on your database. Warning: this will DROP existing tables. Don't run this scraper on a production database.

$ mysql -h HOSTNAME -u USER -p PASSWORD DB_NAME < tables.sql

If you've configured as per above, save yourself some typing:

$ mysql -u"$Q_SCRAPER_DATABASE_USER" -p"$Q_SCRAPER_DATABASE_PASSWORD" -h"$Q_SCRAPER_DATABASE_HOST" "$Q_SCRAPER_DATABASE_NAME" < tables.sql

import courses

Run import_courses.php to import courses, faculty, and academic fields from the CS50 Courses API. The Q scraper accesses these tables to link Q guide IDs with course catalog IDs.

$ php import_courses.php

scrape q

Run scrape_q.php to crawl the Q website and download all relevant HTML into a pages/ directory. The semesters to download are hardcoded in near the top of the file.

Since the Q guide requires authentication, you'll need to login with your PIN to generate a session cookie. Using your browser's web inspector, obtain the value of the JSESSIONID cookie after you've authenticated and pass it to the script:

$ php scrape_q.php COOKIEVAL4F8D49AC

import q

Wait for a few billion hours, and the Q will have downloaded! Now run import_q.php to parse the HTML in pages/ and insert it into the database.

$ php import_q.php

todos

  • Retry on timeout, rather than plowing through the rest of the script.
  • Rewrite this in Python. Or any language other than PHP, really.
  • Refactor so we can multithread. The Q guide is slow, and frequently breaks. A better architecture would use gevent or multithreading to pull links off a global Redis queue. Each coroutine would pull a URL off the queue, parse the HTML, insert any relevant information intto the database, and add any additional URLS to scrape to the queue.

warning

There may be legal issues releasing Q data to non-Harvard affiliates.

credits

Original PHP scraper written by David Malan.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages