census_topic_crawler

This project attempts to get relevant information about Census website topics and enhance a elasticsearch index that contains a set of similar topic names.

Specifically, there are two main areas of this repository. The first is:

Topic Crawler

A spider for crawling Census website topic pages in order to grab relevant information and output that into a formatted JSON document.

There are two spiders (crawlers) that for crawling the Census topic pages.

parent_topics (filename: census_topic_crawler.py)
child_topics (filename: census_child_topic_crawler.py)

In order to execute each of these spiders, run the following command in the base folder of the repository:

scrapy crawl [name of spider] -o [name of output file] -t json

When the above command is executed, the output will be written to the specified filename.

Note: the format of the site may change. If the layout changes, it is not guaranteed that this bot completely execute without errors.

Topic Information Loader

Some scripts to load that information into a new field in an existing topic index and query that index to test the results of the changes. These files are contained in the elastic_scripts directory of the repository (along with an additional README).

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
census_topic_crawler		census_topic_crawler
elastic_scripts		elastic_scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly