Skip to content

cfpb/crawl-cfgov

Repository files navigation

HTML Archive of consumerfinance.gov

Description: This repo contains an archive of consumerfinance.gov HTML, generated via web crawl. This can serve as a resource when answering questions about how the site changes and when those changes happen.

  • Technology: Uses wget to crawl the site and download HTML.
  • Status: See the CHANGELOG.

The archive

The archive of consumerfinance.gov is contained in the www.consumerfinance.gov directory of this repo. It contains an HTML download of each page it crawls. The archive does not download any CSS, JavaScript, or images associated with the pages. It also does not contain any PDF, CSV, or other supplementary files linked from consumerfinance.gov. The HTML is simplified before being archived to prevent unimportant diffs.

Dependencies

This project uses wget to crawl consumerfinance.gov and download the HTML. You can install it on a Mac using brew install wget.

Installation

To get a copy of the consumerfinance.gov archive or run a crawl on your computer, clone this repository.

Usage

Exploring the archive

To view the consumerfinance.gov archive, you can browse the history of this repo here on github.com, or clone this repository.

Performing basic searches in a browser

GitHub.com's search functionality can be used to perform basic searches for words or phrases. For example, searching for "reverse mortgage" returns all pages containing that term. Unfortunately, although GitHub does provide the ability to customize search results, it only supports basic querying and filtering. Advanced searches can be more easily performed locally using shell commands after cloning this repository.

Using shell commands to search locally

Once this repository has been cloned locally, the common shell command grep can be used for common searches. For example, to list all instances of the case-insensitive phrase "reverse mortgage":

grep -ri "reverse mortgage" www.consumerfinance.gov

To list only matching filenames, use the -l option:

grep -ril "reverse mortgage" www.consumerfinance.gov

You may want to also sort the results alphabetically:

grep -ril "reverse mortgage" www.consumerfinance.gov | sort

Versions of grep with support for extended regular expressions allow additional searches. For example, to find all occurrences of a GovDelivery code like USCFPB_12345:

grep -rE 'USCFPB_[0-9]+' www.consumerfinance.gov

The results of a grep search can be piped to tools like sed to do further processing.

For example, let's say you want to check the value of all aria-label attributes on Spanish pages:

# Generate a list of Spanish pages, using the presence of "Un sitio web"
# in the site header to distinguish them from English pages.
grep -rl "Un sitio web" www.consumerfinance.gov > spanish-pages.txt

# Grep only those files again to find all aria-label attributes.
cat spanish-pages.txt | xargs grep aria-label > spanish-aria-labels.txt

# Used sed to extract the list of aria-labels, and show a sorted list of unique values.
cat spanish-aria-labels.txt | sed -n 's/^.*aria-label="\([^"]*\)".*$/\1/p' | sort | uniq

Command line tools like grep and sed are very complex (and can vary depending on operating system), so reading their documentation can be helpful in creating searches.

Running the crawler locally

To run a crawl on your computer, cd into the root of this project and use the following command:

./crawl.sh https://www.consumerfinance.gov

A full crawl can take several hours. To limit the crawl depth:

./crawl.sh -d 4 https://www.consumerfinance.gov

Or, to start the crawl at a specific URL:

./crawl.sh https://www.consumerfinance.gov/start/crawl/here/

Known issues

The crawl has some constraints and limitations.

  • The results intentionally only contain pages that share the same domain.
  • The crawl will not include any pages that are not linked to from any other page reachable from the site root.
  • The crawl records each page based on its url. If we accidentally record a page with url parameters, it counts that as a separate page, which could result in duplication.
  • There are some pages on consumerfinance.gov that can only be found by paging through paginated lists of results. We try to configure the crawl to find and download all of these pages, but it's possible there will be omissions.

Getting help

If you have questions, concerns, bug reports, etc, please file an issue in this repository's Issue Tracker.

Getting involved

See our contributing guidelines.


Open source licensing info

  1. TERMS
  2. LICENSE
  3. CFPB Source Code Policy

About

Archive the HTML of consumerfinance.gov daily

Resources

License

Stars

Watchers

Forks

Languages