The crawler

This is a crawler written in Scrapy to crawl a site without a using sitemap. It recursively crawls the entire site within its domain ignoring the offsite requests. The crawling is throttled using the autothrottle feature of Scrapy to limit a maximum of two requests every two seconds. It creates an output file in this format

Scrapy's documentation is available here

Contributers welcome.

Installation

Download the project
Extract to a folder and navigat to the location
install dependencies using Run pip install -r requirements.txt to install dependencies

Recommended to activate virtual env before installing the dependencies. The project uses scrapy for the crawling and another library tldextract for working with domain names

How to run ?

Use the following command to run

scrapy runspider crawler/spiders/crawler.py -a urlList="path/to/input/domain_list.txt"

Summing up

Assuming you have git installed in your system & input_domain_urls.txt contains say https://www.example.com

mkdir test ; cd test
git clone https://github.com/yackoa/yet_another_crawler.git .
virtualenv env
source env/bin/activate
pip install -r requirements.txt

scrapy runspider crawler/spiders/crawler.py -a urlList="input_domain_urls.txt"

Output file

Say if the domain name was example.com then the output file name will be example.com.txt

https://www.example.com
https://www.example.com/product/123
https://www.example.com/page/about-us
[...]

TODO

Get tests working. Lost a lot of time trying betamax, fake_offline_requests. I am willing to learn, could you teach me to test the crawlSpider in Scrapy ?

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
crawler		crawler
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
input_domain_urls.txt		input_domain_urls.txt
mic_drop.gif		mic_drop.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

input_domain_urls.txt

input_domain_urls.txt

mic_drop.gif

mic_drop.gif

requirements.txt

requirements.txt

Repository files navigation

The crawler

Installation

How to run ?

Output file

TODO

Hasta la vista, baby.

About

Releases

Packages

Languages

License

yackoa/yet_another_crawler

Folders and files

Latest commit

History

Repository files navigation

The crawler

Installation

How to run ?

Output file

TODO

Hasta la vista, baby.

About

Topics

Resources

License

Stars

Watchers

Forks

Languages