Skip to content

kshru9/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image image image

Multi Threaded Web-Crawler

Description

The goal of this project is to create a multi-threaded web crawler. A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be to implement a multi-threaded downloader that can download multi websites at the same time. The plan is to implement this in C or C++.

Features/Deliverables

  • Part 1: Multi-threaded web-crawler
  • Part 2: (Extended scope) Web Ranking

Simple Crawler Flowchart

flowchart

  • HTTP website downloader
    • using socket library
  • HTTPs websites downloader
    • using openssl library
  • HTML file parser
    • using regex
  • Domain extractor
    • using regex
  • Crawler loop
  • Website ranker
    • using a simple counter
    ...
    while(!mainQueue.empty() && totalVisitedPages < pagesLimit)
    {
        currWebsite = mainQueue.pop()
        html = websiteDownloader(currWebsite)
        linkWebsite = htmlParser(html)
        update(discoveredWebsites, mainQueue, totalVisitedPages)
    }
    ...
  • use make to compile the program
  • maxlinks, pagelimit can be given as argument in with make command.
    • For e.g. make maxlinks=100 pagelimit=100
    • Here the arguments are:
      • maxlinks: Maximum number of links to be extracted from a website
      • pagelimit: Maximum number of websites to be downloaded while crawling

Back to Table of Contents

  • Crawler as a thread controller
  • Child thread
    • HTML downloader
    • Link extractor
    • Domain extractor
    • Ranking using different algorithms

Crawler loop code

...
while(1){
    if(pagesLimitReached || visitedpages>=pagesLimit){
        pagesLimitReached = true;
        if(w_threads){
            gotosleep();
        }
        else {
            break;
        }
    }
    else{
        if (w_threads < maxThreads && queue_size>0){
            createThread();
        }
        else if(w_threads == 0){
            break;
        }
        else{
            gotosleep();
        }
    }
}
...

Child Thread code


...
download(url);
parse(url);
update(queue, visitedLinks, ranking);
if(pagesLimitReached){
    if(workingThreads == 0){
        wake_parent();
    }
}
else{
    wake_parent();
}
...
  • Using SINGLE LOCK technique

    • Having one lock for all of our shared variables
    • Pros Easy to implement
    • Cons Need to keep the lock acquired for more amount of time which may result in serial processing only
  • Using THREAD SAFE DATA STRUCTURE technique

    • For each of the data structure, having a single lock or RW lock if required
    • Waiting time distributed over different locks
    • Different thread safe data structures:
      • Thread safe integer
      • Thread safe queue
      • Thread safe map
    • Pros No need to keep the lock acquired for more amount of time. Hence concurrency can be efficiently achieved in multi processor CPUs
    • Cons Overhead due to multiple locks
  • use make to compile the program
  • maxlinks, pagelimit, threads can be given as argument in with make command.
    • For e.g. make maxlinks=100 pagelimit=100 threads=20
    • Here the arguments are:
      • maxlinks: Maximum number of links to be extracted from a website
      • pagelimit: Maximum number of websites to be downloaded while crawling
      • threads: Maximum number of threads to be created
      • rankerFlag: Flag to choose which ranking algorithm to be executed (n = simple counter based web ranker, sp = pagerank with sampling, ip = pagerank with iteration)

Back to Table of Contents

  • Simple counter based The intuition behind this approach of ranking is to increase the rank of a domain name whenever we visit it.

  • PageRank algorithm The intuition behind our pagerank algorithm is as follows. Suppose there is a random surfer which randomly chooses a starting website. After that it chooses next website from all the linked website with current chosen website with probability of 0.85 or it chooses next website from all available websites with a probability of 0.15.

    In this way, the importance of a website is measured by how repetitively a random surfer visits a website. Hence, a website is important if it is linked to more number of important websites.

    There are two ways to implement this algorithm


...
corpus = read(csv_file)
for website in corpus.keys():
    for x in corpus[website]:
        rank[website]+=1
...

Demo run

------------------------------------------------
  Domain Name rankings using counter
------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  mygov.in                                  43
2 .  main.ayush.gov.in                         36
3 .  tourism.gov.in                            24
4 .  digitalindia.gov.in                       19
5 .  asi.nic.in                                16
------------------------------------------------------------

Back to Table of Contents

In this approach, we have asked random surfer for certain number of sample times to choose a website from all websites which are weighted as per the pagerank algorithm intuition.


...
DAMPING = 0.85
SAMPLES = 10000

corpus = read(csv_file)

for i in range(1,SAMPLES):
    for x in corpus.keys():
        if (x in corpus[page]):
            model[x] = ((DAMPING * (1/number_of_linked_websites)) + ((1-DAMPING)* (1/total_websites)))
        else:
            model[x] = ((1-DAMPING)* (1/total_websites))
    x = random.choices(websites, weights=model.values())[0]
    pagerrank[x] += (1/n)
return pagerrank
...

Demo run

-------------------------------------------------------------
  Domain Name ranking using PageRank: Sampling (n = 10000)
-------------------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  haryana.mygov.in                          0.1290000000000021
2 .  chhattisgarh.mygov.in                     0.11000000000000212
3 .  blog.mygov.in                             0.07730000000000119
4 .  mygov.in                                  0.07260000000000105
5 .  aatmanirbharbharat.mygov.in               0.04840000000000036
------------------------------------------------------------

Back to Table of Contents

The intuition behind ranking using iterative pagerank algorithm is as follows. We will intialize the probability of surfer visiting a given website to 1/total_websites. We will update the probability of every website according to below formula.

Pagerank equation

We will stop iterating when the difference between old and updated probabilities is less than certain threashold.


...
DAMPING = 0.85
THRESHOLD = 0.001
corpus = read(csv_file)

while(1):
    before = [pagerrank[v] for v in pagerrank.keys()]
    for x in corpus.keys():
        for v in linkedWebsites:
            pagerrank[x] += (DAMPING*(pagerank[v]/total_linkedWebsites))
        pagerank[x] += ((1-DAMPING)/number_of_websites)
        
    if (change(before, [pagerrank[v] for v in pagerrank.keys()]) <= THRESHOLD):
        break

return pagerrank
...

Demo run

----------------------------------------------
  Domain Name ranking using PageRank: Iteration
----------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  india.gov.in                              0.01762736346840192
2 .  digitalindia-gov.zoom.us                  0.017587054793058533
3 .  tourism.gov.in                            0.016866581191734113
4 .  digitalindiaawards.gov.in                 0.014974653859083446
5 .  mygov.in                                  0.0122561916194452
------------------------------------------------------------

Back to Table of Contents

  • We have made a graph between number of threads vs time. According to this graph, we can infer as follows:
    • When number of threads are very low, time required for crawling is large
    • Time increases when number of threads becomes huge becuase of locking overhead
    • When we use optimal number of threads, concurrent crawling is useful.

Crawler Analytics

** Above graph varies whenever your run the code. Above all inferences are made according to general trend in graph which is seen for most of times

  • We have made a graph by varying one of our parameters pagelimit vs time. According to this graph, we can infer as follows:
    • As pagelimit increases, crawler time increases
    • Multithreaded with single locking works better than single threaded becuase of concurrency
    • Multithreaded with thread safe data structures works worst than other two appraoches because of locking overheads. Because in this approach, we have used individual locks for each data structures. And while crawling we needed to acquire and release locks back to back to update each data structure. This increases lot of overhead. As a result, time increases singnificantly in this approach

Page limit vs time elapsed

Back to Table of Contents

  • Sockets
  • OpenSSL
  • Pthread library
    • For concurrency and synchronization techniques
      • Locks
        • Single locks
        • Reader Writer locks
      • Condition Variables
  • Matplotlib
    • Plotting the graphs
  • Efficient downloader to download all websites
  • Efficient parser to parse large websites
  • Creating a search engine by using web indexing and web scraping
  • Three main branches:
    • main: contains the multithreaded using threadsafe data structure web crawler
    • MT_singlelock: contains the multithreaded using single lock web crawler
    • single_threaded: contains the single threaded web crawler

Back to Table of Contents

  • fork the repo from top right corner of this page
  • Run following in your terminal

    $ git clone https://github.com/[USERNAME]/Web-Crawler
    $ cd Web-Crawler
    $ make

Presentation link

Contributors

anupam
Anupam Kumar
anupam
Preeti Chiluveru
anupam
Shruti Katpara
anupam
Vivek Modi

ForTheBadge built-with-love ForTheBadge powered-by-electricity