Skip to content
This repository has been archived by the owner on Nov 11, 2017. It is now read-only.

remram44/crawler-structure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is this?

This is the basic structure for a web crawler. It doesn't actually crawl, but everything is there to add it -- you just need to select new links from a page to download. However I'm not interested in doing that right now, so it's not there.

What is there is a way to start a crawler instance and update the HTML page with new result dynamically (over a websocket). There is also code to perform a search with the Bing API.

This code uses the Twisted framework, an asynchronous network engine which allows it to perform many requests in parallel, to serve the website and to communicate on websockets. The websocket protocol implementation comes from Autobahn.

Why?

Mainly wanted to try Twisted, but I'm not interested in the HTML handling & classification problems. However I might add that later on if I feel like it.

About

Basic Twisted structure for web crawling (doesn't actually crawl right now)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published