googlesearch

search engine where anyone can register website and this engine do crawling and index web pages developed using django.

Most of search engine have to perform following tasks:

Crawling
Scraping or Extracting Data
Indexing
PageRank

Crawling

There are many packages in python for crawling like BeautifulSoup, scrapy etc. Here we use BeautifulSoup to crawl a website.

Installing BeautifulSoup

  pip install beautifulsoup4

Crawling a website

For crawling a website you need a url of a website and then you have to generate request for that website.

Generate a request

first install requests library to generate request

  pip install requests

generate request and crawl a webpage

  source = requests.get('enter your url here').text
  soup = BeautifulSoup(source, 'lxml')

Scraping or extracting a data

Whenever crawling performs it stores a copy of a webpage and in database store the url of copied webpage and title of webpage and many more information. Below is the example of extracting data:-

  title = soup.title.text
  h2 = soup.find('h2')
  #finding link
  link = soup.find('a')
  #finding href attribute from link
  href = link['href']

Indexing

Indexing means store each and every word with it's url you can think as a dictionary where word is the key and value is url and this is called backward indexing if you want to know more about backward indexing and other search engine indexing follow

Below is the database for backward indexing

  class Backward_Index(models.Model):
      data = models.CharField(max_length=100)
      urls = models.TextField()

      def __str__(self):
          return self.data

Code to performing backward indexing

  #perform backward indexing
  def backward_indexing(name, list_of_words):
      #bi = Backward_Index()
      #bi.back_indexing(name, list_of_words)
  	for words in list_of_words:
  		if Backward_Index.objects.filter(data=words).exists():
  			indexes = Backward_Index.objects.filter(data=words)
  			for index in indexes:
  				if name in index.urls:
  					continue
  				else:
  					index.urls = index.urls + ',' + name
  					index.save()

  		else:
  			Backward_Index.objects.create(data=words, urls=name)

PageRank Algorithm

Page rank algorithm is the heart of the every search engine to get best search results. There are many page ranking algorithm exists, here point distribution algorithm implemented.

To perform this algorithm you have to make graph of every url where search keyword exists and then distribute point. Making a graph using networkx library.

Install networkx

  pip install networkx

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
engine		engine
googlesearch		googlesearch
search		search
README.md		README.md
manage.py		manage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine

engine

googlesearch

search

search

README.md

README.md

manage.py

manage.py

Repository files navigation

googlesearch - Search Engine

Crawling

Installing BeautifulSoup

Crawling a website

Scraping or extracting a data

Indexing

PageRank Algorithm

About

Releases

Packages

Languages

apurva-ajmera/googlesearch

Folders and files

Latest commit

History

Repository files navigation

googlesearch - Search Engine

Crawling

Installing BeautifulSoup

Crawling a website

Scraping or extracting a data

Indexing

PageRank Algorithm

About

Topics

Resources

Stars

Watchers

Forks

Languages