Multi Threaded Web-Crawler

Description

The goal of this project is to create a multi-threaded web crawler. A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be to implement a multi-threaded downloader that can download multi websites at the same time. The plan is to implement this in C or C++.

Features/Deliverables

Part 1: Multi-threaded web-crawler
Part 2: (Extended scope) Web Ranking

Simple Crawler Flowchart

Single Threaded Web Crawler

Single Threaded Components

HTTP website downloader
- using socket library
HTTPs websites downloader
- using openssl library
HTML file parser
- using regex
Domain extractor
- using regex
Crawler loop
Website ranker
- using a simple counter

Single Threaded Psuedocode

    ...
    while(!mainQueue.empty() && totalVisitedPages < pagesLimit)
    {
        currWebsite = mainQueue.pop()
        html = websiteDownloader(currWebsite)
        linkWebsite = htmlParser(html)
        update(discoveredWebsites, mainQueue, totalVisitedPages)
    }
    ...

How to run single threaded web crawler

use make to compile the program
maxlinks, pagelimit can be given as argument in with make command.
- For e.g. make maxlinks=100 pagelimit=100
- Here the arguments are:
  - maxlinks: Maximum number of links to be extracted from a website
  - pagelimit: Maximum number of websites to be downloaded while crawling

Back to Table of Contents

MULTITHREADED Web Crawler

Multithreaded Components

Crawler as a thread controller
Child thread
- HTML downloader
- Link extractor
- Domain extractor
- Ranking using different algorithms

Multithreaded Psuedocode

Crawler loop code

...
while(1){
    if(pagesLimitReached || visitedpages>=pagesLimit){
        pagesLimitReached = true;
        if(w_threads){
            gotosleep();
        }
        else {
            break;
        }
    }
    else{
        if (w_threads < maxThreads && queue_size>0){
            createThread();
        }
        else if(w_threads == 0){
            break;
        }
        else{
            gotosleep();
        }
    }
}
...

Child Thread code


...
download(url);
parse(url);
update(queue, visitedLinks, ranking);
if(pagesLimitReached){
    if(workingThreads == 0){
        wake_parent();
    }
}
else{
    wake_parent();
}
...

Different locking techniques

Using SINGLE LOCK technique
- Having one lock for all of our shared variables
- Pros Easy to implement
- Cons Need to keep the lock acquired for more amount of time which may result in serial processing only
Using THREAD SAFE DATA STRUCTURE technique
- For each of the data structure, having a single lock or RW lock if required
- Waiting time distributed over different locks
- Different thread safe data structures:
  - Thread safe integer
  - Thread safe queue
  - Thread safe map
- Pros No need to keep the lock acquired for more amount of time. Hence concurrency can be efficiently achieved in multi processor CPUs
- Cons Overhead due to multiple locks

How to run multi threaded web crawler

use make to compile the program
maxlinks, pagelimit, threads can be given as argument in with make command.
- For e.g. make maxlinks=100 pagelimit=100 threads=20
- Here the arguments are:
  - maxlinks: Maximum number of links to be extracted from a website
  - pagelimit: Maximum number of websites to be downloaded while crawling
  - threads: Maximum number of threads to be created
  - rankerFlag: Flag to choose which ranking algorithm to be executed (n = simple counter based web ranker, sp = pagerank with sampling, ip = pagerank with iteration)

Back to Table of Contents

Website domain ranking algorithms

Simple counter based The intuition behind this approach of ranking is to increase the rank of a domain name whenever we visit it.
PageRank algorithm The intuition behind our pagerank algorithm is as follows. Suppose there is a random surfer which randomly chooses a starting website. After that it chooses next website from all the linked website with current chosen website with probability of 0.85 or it chooses next website from all available websites with a probability of 0.15.

In this way, the importance of a website is measured by how repetitively a random surfer visits a website. Hence, a website is important if it is linked to more number of important websites.

There are two ways to implement this algorithm
- Iterative based PageRank algorithm
- Sampling based PageRank algorithm

Simple counter based


...
corpus = read(csv_file)
for website in corpus.keys():
    for x in corpus[website]:
        rank[website]+=1
...

Demo run

------------------------------------------------
  Domain Name rankings using counter
------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  mygov.in                                  43
2 .  main.ayush.gov.in                         36
3 .  tourism.gov.in                            24
4 .  digitalindia.gov.in                       19
5 .  asi.nic.in                                16
------------------------------------------------------------

Back to Table of Contents

Sampling based PageRank algorithm

In this approach, we have asked random surfer for certain number of sample times to choose a website from all websites which are weighted as per the pagerank algorithm intuition.


...
DAMPING = 0.85
SAMPLES = 10000

corpus = read(csv_file)

for i in range(1,SAMPLES):
    for x in corpus.keys():
        if (x in corpus[page]):
            model[x] = ((DAMPING * (1/number_of_linked_websites)) + ((1-DAMPING)* (1/total_websites)))
        else:
            model[x] = ((1-DAMPING)* (1/total_websites))
    x = random.choices(websites, weights=model.values())[0]
    pagerrank[x] += (1/n)
return pagerrank
...

Demo run

-------------------------------------------------------------
  Domain Name ranking using PageRank: Sampling (n = 10000)
-------------------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  haryana.mygov.in                          0.1290000000000021
2 .  chhattisgarh.mygov.in                     0.11000000000000212
3 .  blog.mygov.in                             0.07730000000000119
4 .  mygov.in                                  0.07260000000000105
5 .  aatmanirbharbharat.mygov.in               0.04840000000000036
------------------------------------------------------------

Back to Table of Contents

Iterative based PageRank algorithm

The intuition behind ranking using iterative pagerank algorithm is as follows. We will intialize the probability of surfer visiting a given website to 1/total_websites. We will update the probability of every website according to below formula.

We will stop iterating when the difference between old and updated probabilities is less than certain threashold.


...
DAMPING = 0.85
THRESHOLD = 0.001
corpus = read(csv_file)

while(1):
    before = [pagerrank[v] for v in pagerrank.keys()]
    for x in corpus.keys():
        for v in linkedWebsites:
            pagerrank[x] += (DAMPING*(pagerank[v]/total_linkedWebsites))
        pagerank[x] += ((1-DAMPING)/number_of_websites)
        
    if (change(before, [pagerrank[v] for v in pagerrank.keys()]) <= THRESHOLD):
        break

return pagerrank
...

Demo run

----------------------------------------------
  Domain Name ranking using PageRank: Iteration
----------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  india.gov.in                              0.01762736346840192
2 .  digitalindia-gov.zoom.us                  0.017587054793058533
3 .  tourism.gov.in                            0.016866581191734113
4 .  digitalindiaawards.gov.in                 0.014974653859083446
5 .  mygov.in                                  0.0122561916194452
------------------------------------------------------------

Back to Table of Contents

Graphs

We have made a graph between number of threads vs time. According to this graph, we can infer as follows:
- When number of threads are very low, time required for crawling is large
- Time increases when number of threads becomes huge becuase of locking overhead
- When we use optimal number of threads, concurrent crawling is useful.

** Above graph varies whenever your run the code. Above all inferences are made according to general trend in graph which is seen for most of times

We have made a graph by varying one of our parameters pagelimit vs time. According to this graph, we can infer as follows:
- As pagelimit increases, crawler time increases
- Multithreaded with single locking works better than single threaded becuase of concurrency
- Multithreaded with thread safe data structures works worst than other two appraoches because of locking overheads. Because in this approach, we have used individual locks for each data structures. And while crawling we needed to acquire and release locks back to back to update each data structure. This increases lot of overhead. As a result, time increases singnificantly in this approach

Back to Table of Contents

Libraries used

Sockets
OpenSSL
Pthread library
- For concurrency and synchronization techniques
  - Locks
    - Single locks
    - Reader Writer locks
  - Condition Variables
Matplotlib
- Plotting the graphs

Future extension of project

Efficient downloader to download all websites
Efficient parser to parse large websites
Creating a search engine by using web indexing and web scraping

Repo structure

Three main branches:
- main: contains the multithreaded using threadsafe data structure web crawler
- MT_singlelock: contains the multithreaded using single lock web crawler
- single_threaded: contains the single threaded web crawler

Back to Table of Contents

How to reuse this repo

fork the repo from top right corner of this page
Run following in your terminal


    $ git clone https://github.com/[USERNAME]/Web-Crawler
    $ cd Web-Crawler
    $ make

Pull requests are highly appreciated
All the codes are well commented, in case of query feel free to contact

Presentation link

Google slides_

Credits

Contributors

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.vscode		.vscode
INPUT		INPUT
OUTPUT		OUTPUT
UTILS		UTILS
headers		headers
thread_safe		thread_safe
Crawler.cpp		Crawler.cpp
Crawler.h		Crawler.h
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp
ranker.py		ranker.py
test_bench.cpp		test_bench.cpp
test_bench.py		test_bench.py
th_timings.py		th_timings.py

kshru9/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Multi Threaded Web-Crawler

Description

Features/Deliverables

Simple Crawler Flowchart

Crawler loop code

Child Thread code

Demo run

Demo run

Demo run

Presentation link

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Languages