U.S. Department of Education scraping kit

NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g. edscrapers/scrapers or edscrapers/transformers).

Running the tool

Clone this repo using git clone.

Change directory into the directory created/cloned for this repo.

From within the repo directory run pip install -r requirements.txt to install all package dependencies required to run the toolkit

You need the ED_OUTPUT_PATH environment variable to be set before running. Not having the variable set in your environment will result in a fatal error. The ED_OUT_PATH environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.

If GNU Make is available in your environment, you can run the command make install. Alternatively, run python setup.py install.

After installing, run the eds command in a command line prompt.

Containerization of Scraping Toolkit - Docker Image

If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image. Simply run : docker build in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile

ED Scrapers Command Line Interface

To get more info on the usage on the ED Scrapers Command Line Interface - eds, read the eds cli docs.

Architectural Design

To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc

Terminology

Scraping Source: a website (or section of website) where you scrape information from
Scraper: A script that collects structured data from (rather unstructured) web pages
- Crawler: A script that follows links and identifies all the pages containing information to be parsed
- Parser: A script that identifies data in HTML and loads it into a machine readable data structure
Transformer: a script that takes a data structure and adapts it to a target structure
ETL: Extract + Transform + Load process for metadata.
Data.json: A specific JSON format used by CKAN harvesters. Example

Scrapers

Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:

A crawler class that defines rules for link extraction and page filters
- This will be instantiated by a CrawlerProcess in the main scraper.py script
A parser script that is essentially a callback for fetching HTML pages. It receives a Scrapy Response payload, which can be parsed using any HTML parsing methods
- An optional Model class, to define the properties of extracted datasets and make them more flexible for dumping or automating operations if needed

Transformers

Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).

We currently have 7 transformers in place:

deduplicate: removes duplicates from scraping
sanitize: cleans up the scraping output data/metadata based on specified rules.
datajson: creates data.json files from the scraping output; these data.json files can then by ingested/harvested by ckanext-harvest (used to populate a CKAN data portal).
rag: produces RAG analyses output files using an agreed weighted-value system for calculating the quality of metadata generated by the datajson transformer and (by extension) the 'raw' scraping output.
TODO: Add info about the others

License

GNU AFFERO GENERAL PUBLIC LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 491 Commits
edscrapers		edscrapers
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
__init__.py		__init__.py
architecture_diagram.png		architecture_diagram.png
docker_container_architecture_diagram.png		docker_container_architecture_diagram.png
multi-collections.py		multi-collections.py
multi-sources.py		multi-sources.py
process_diagram.png		process_diagram.png
requirements.txt		requirements.txt
setup.py		setup.py

License

CivicActions/edscrapers

Folders and files

Latest commit

History

Repository files navigation

U.S. Department of Education scraping kit

Running the tool

Containerization of Scraping Toolkit - Docker Image

ED Scrapers Command Line Interface

Architectural Design

Terminology

Scrapers

Transformers

License

About

Resources

License

Stars

Watchers

Forks

Languages