-
Stages:
- Setting up Environment
- Get the HTML
- Parse the HTML
- HTML Tree traversal
- Use BeautifulSoup library to scrap data
- Save the result in
.csv
file
-
Ways to scrap a website:
- Use the API
- Scrap the HTML using tools (like Selenium) and libraries (like BeautifulSoup)
-
Basic Requirements:
- Python with Environment Setup
- VSCode with Python (Microsoft) and Jupyter extensions
- Jupyter Notebook (Optional) (On VSCode/Browser)
-
requests
library is used to make GET and POST request on the webpage -
html5lib
library is used to parse the HTML -
BeautifulSoup
library is used to manage the scrap of HTML
pip install requests
pip install html5lib
pip install bs4
import json
import requests
from bs4 import BeautifulSoup
url="https://www.codewithharry.com"
r = requests.get(url)
htmlContent = r.content
print(htmlContent)
soup = BeautifulSoup(htmlContent, 'html.parser')
print(soup)
print(soup.prettify) # .prettify formats the code well with proper indents
title = soup.title
print(title) # Prints the <title> tag on console
print(title.string) # Prints the string of the <title> on console
- Tag
- NavigableString - It is different from normal string in Python because it comes with in-build special functions
- BeautifulSoup
- Comment
print(type(title)) # Prints the type on console. Here, it is Tag <class 'bs4.element.Tag'>
print(type(title.string)) # Prints the type on console. Here, it is String <class 'bs4.element.NavigableString'>
print(type(soup)) # Here, it is BeautifulSoup object <class 'bs4.BeautifulSoup'>
markup = "<p><!-- this is a comment --></p>"
soup2 = BeautifulSoup(markup)
print(soup2.p) # Prints the p tag on console
print(soup2.p.string) # Prints the content of p tag on console ("this is a comment")
print(type(soup2.p.string)) # Here, it is Comment <class 'bs4.element.Comment'>
# exit() # To terminate the execution of program at any time
paras = soup.find_all('p')
print(paras)
anchors = soup.find_all('a')
print(anchors)
para = soup.find('p')
print(para)
paraClass = (soup.find('p')['class'])
print(paraClass)
print(soup.find_all("p", class_="lead"))
print(soup.find('p').get_text())
print(soup.get_text())
for link in anchors:
print(link.get('href'))
all_links = set()
for link in anchors:
if(link.get('href')!='#'):
linkText = "https://codewithharry.com" + link.get('href')
all_links.add(link)
print(linkText)
navbarSupportedContent = soup.find(id="navbarSupportedContent")
print(navbarSupportedContent)
print(navbarSupportedContent.children)
print(navbarSupportedContent.contents) # Prints all content elements/tag inside the div in form of list
for elem in navbarSupportedContent.contents: # Prints all content one by one using loops
print(elem)
for elem in navbarSupportedContent.children: # Prints the same as contents
print(elem)
- The difference between children and content
.contents
- A tag's children are available as a list.children
- A tag's children are available as a generator - List uses memory to store.
- For very big large webpages,
.children
is more efficient by taking less space in memory
for item in navbarSupportedContent.strings:
print(item)
for item in navbarSupportedContent.stripped_strings: # Strips the string for ease of use
print(item)
print(navbarSupportedContent.parent) # Immediate Parents only
print(navbarSupportedContent.parents) # Generator object is displayed that menas it can be iterated
for item in navbarSupportedContent.parents:
# print(item)
print(item.name)
print(navbarSupportedContent.next_sibling.next_sibling)
print(navbarSupportedContent.previous_sibling.previous_sibling)
elem = soup.select('#loginModal')
print(elem)
elem = soup.select('.modal-footer')
print(elem)
- It requires Pandas library
pip install pandas
- Import the pandas into the program for storing the data into data frame and saving the data frame later into the
.csv
- Python code to request the URL:
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link
- Output when printing the html :
<!DOCTYPE html>
<html>
<head>
<title>Naukri reCAPTCHA</title>
#the title in the actual title of the URL that I am requested for
<meta name="robots" content="noindex, nofollow" />
<link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
</head>
</html>
- Using
Google Cache
along with areferer
prevents these captcha's (do remember not to send more than 2 requests/sec. You may get blocked:
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,'referer':'https://www.google.com/'}
r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)
- This gives:
>>> r.content
[Squeezed 2554 lines]
- Name - Abhinav
- GitHub - github.com/abhinavg916