Skip to content

Spider / web crawler. 1 million links in 2 hours.

Notifications You must be signed in to change notification settings

rbk/python-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

About

The founders of Google used python to fetch pages to build their search engine.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python"

I recommend reading the thesis: http://infolab.stanford.edu/~backrub/google.html

This project reflects my curiosity to create a web crawler as a programming challege.

The stack:

  • Python 3.6
  • Flask
  • BeautifulSoup
  • PyMysql

Progress Log

2019/07/16

In progress...

2017/4/28

Start URLs: https://www.reddit.com, https://nytimes.com, http://dmoztools.net Crawl time: Approx 4 hours Unique URLS: 1,003,156 Unique Domains: 79,302

2017/4/27

Start URL: https://www.reddit.com Crawl time: Approx 4 hours Unique URLS: 84,477 Unique Domains: 2,747

About

Spider / web crawler. 1 million links in 2 hours.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages