BlogCrawler

This repository hosts an implementation of a web crawler which contains capabilities to scrape contents of blog posts from security related companies.

Pre-requisites

Python 3.6
NumPy 1.14+
Pandas 0.23+
Selenium 3.141
Scrapy 1.7.3

Configuration Instructions

Modify settings.py to set USER_AGENT_LIST + PROXY_LIST to the project's metainfo useragents.txt and proxylist.txt
Create configurations under [GithubProject]/config

For blogs which contain AJAX components, create a new dynamic configuration with a file name of your choice. Fill in the following yaml keys:

search_link = blog home page
search_action_config = actions performed before the scraping process (ex. click, click and fill)
page_scrape_config = xpath configuration to locate the blog article links (page_links_xpath) and next page indicator (next_page_xpath)
blog_scrape = xpath configuration used to scrape each blog article

For blogs which can be scraped statically, create a new static configuration with a file name of your choice. Fill in the following yaml keys:

search_link = blog home page
page_scrape_config = xpath configuration to locate the blog article links (page_links_xpath) and next page indicator (next_page_xpath)
blog_scrape = xpath configuration used to scrape each blog article

Running Instructions

Create static/dynamic configuration files using the configuration instructions above. Refer to existing configurations for reference
Modify BaseStaticScrape.py/BaseDynamicScrape.py to only include the filename of the configuration(s) you would like to run.
Run BaseStaticScrape.py + BaseDynamicScrape.py
Run BlogScrape.py
Results will be written under the [GithubProject]/temp folder

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
drivers		drivers
spiders		spiders
util		util
.gitignore		.gitignore
0. CleanUp.py		0. CleanUp.py
1. BaseStaticScrape.py		1. BaseStaticScrape.py
2. BaseDynamicScrape.py		2. BaseDynamicScrape.py
3. BlogScrape.py		3. BlogScrape.py
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
conf.json		conf.json
items.py		items.py
middlewares.py		middlewares.py
pipelines.py		pipelines.py
randomproxy.py		randomproxy.py
settings.py		settings.py

License

resess/BlogScrapeUtilities

Folders and files

Latest commit

History

Repository files navigation

BlogCrawler

Pre-requisites

Configuration Instructions

Running Instructions

About

Resources

License

Stars

Watchers

Forks

Languages