Skip to content

okwilkins/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Web Crawler

Created by Oliver Wilkins

19/03/2018

This program will crawl through entire domains, exporting every link it can find into a txt file.

Installating/Running the Project

You will not need to download any libraries, plug-in and play by:

  • Downloading or cloning the repository
  • Running the main.py file
  • Links which the program saves are found in the queued.txt and crawled.txt files in the projects folder - the folder has example projects with queued.txt and crawled.txt

Important

  • This program works by reading a webpage and extracting the links to the queued.txt file, when gotten round to the program will read further links from the queued.txt file and will then dump the then completed (crawled) webpage to the crawled.txt file
  • You can try to trawl through massive domains, with many links - this will take a VERY long time however
  • Also note that you may need to change the NUMBER_OF_THREADS variable in the main.py (line 12) file - this will depend on your operating system
NUMBER_OF_THREADS = 8

Updates for the Future

  • Add a tree view for all the links found
  • Reduce the number of decoding errors
  • Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail here
  • Create a nicer output to the console + a GUI

About

This program will crawl through entire domains, exporting every link it can find into a txt file.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages