Python Web Crawler

Created by Oliver Wilkins

19/03/2018

This program will crawl through entire domains, exporting every link it can find into a txt file.

Installating/Running the Project

You will not need to download any libraries, plug-in and play by:

Downloading or cloning the repository
Running the main.py file
Links which the program saves are found in the queued.txt and crawled.txt files in the projects folder - the folder has example projects with queued.txt and crawled.txt

Important

This program works by reading a webpage and extracting the links to the queued.txt file, when gotten round to the program will read further links from the queued.txt file and will then dump the then completed (crawled) webpage to the crawled.txt file
You can try to trawl through massive domains, with many links - this will take a VERY long time however
Also note that you may need to change the NUMBER_OF_THREADS variable in the main.py (line 12) file - this will depend on your operating system

NUMBER_OF_THREADS = 8

Updates for the Future

Add a tree view for all the links found
Reduce the number of decoding errors
Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail here
Create a nicer output to the console + a GUI

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Debug		Debug
projects		projects
README.md		README.md
domain.py		domain.py
general.py		general.py
linkFinder.py		linkFinder.py
main.py		main.py
spider.py		spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug

Debug

projects

projects

README.md

README.md

domain.py

domain.py

general.py

general.py

linkFinder.py

linkFinder.py

main.py

main.py

spider.py

spider.py

Repository files navigation

Python Web Crawler

Created by Oliver Wilkins

19/03/2018

Installating/Running the Project

Important

Updates for the Future

About

Releases

Packages

Languages

okwilkins/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Python Web Crawler

Created by Oliver Wilkins

19/03/2018

Installating/Running the Project

Important

Updates for the Future

About

Topics

Resources

Stars

Watchers

Forks

Languages