Skip to content

CivicActions/edscrapers

Repository files navigation

U.S. Department of Education scraping kit

NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g. edscrapers/scrapers or edscrapers/transformers).

Running the tool

Clone this repo using git clone.

Change directory into the directory created/cloned for this repo.

From within the repo directory run pip install -r requirements.txt to install all package dependencies required to run the toolkit

You need the ED_OUTPUT_PATH environment variable to be set before running. Not having the variable set in your environment will result in a fatal error. The ED_OUT_PATH environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.

If GNU Make is available in your environment, you can run the command make install. Alternatively, run python setup.py install.

After installing, run the eds command in a command line prompt.

Containerization of Scraping Toolkit - Docker Image

If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image. Simply run : docker build in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile

ED Scrapers Command Line Interface

To get more info on the usage on the ED Scrapers Command Line Interface - eds, read the eds cli docs.

Architectural Design

To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc

Terminology

  • Scraping Source: a website (or section of website) where you scrape information from
  • Scraper: A script that collects structured data from (rather unstructured) web pages
    • Crawler: A script that follows links and identifies all the pages containing information to be parsed
    • Parser: A script that identifies data in HTML and loads it into a machine readable data structure
  • Transformer: a script that takes a data structure and adapts it to a target structure
  • ETL: Extract + Transform + Load process for metadata.
  • Data.json: A specific JSON format used by CKAN harvesters. Example

Scrapers

Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:

  • A crawler class that defines rules for link extraction and page filters
    • This will be instantiated by a CrawlerProcess in the main scraper.py script
  • A parser script that is essentially a callback for fetching HTML pages. It receives a Scrapy Response payload, which can be parsed using any HTML parsing methods
    • An optional Model class, to define the properties of extracted datasets and make them more flexible for dumping or automating operations if needed

Transformers

Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).

We currently have 7 transformers in place:

  • deduplicate: removes duplicates from scraping

  • sanitize: cleans up the scraping output data/metadata based on specified rules.

  • datajson: creates data.json files from the scraping output; these data.json files can then by ingested/harvested by ckanext-harvest (used to populate a CKAN data portal).

  • rag: produces RAG analyses output files using an agreed weighted-value system for calculating the quality of metadata generated by the datajson transformer and (by extension) the 'raw' scraping output.

  • TODO: Add info about the others

License

GNU AFFERO GENERAL PUBLIC LICENSE

Releases

No releases published

Packages

No packages published

Languages