Cant Get H2 Text using Beautifulsoup python 3 #62223

anasadiek · 2023-07-30T15:48:02Z

anasadiek
Jul 30, 2023

Select Topic Area

Question

Body

I have a problem when scraping the following web page

https://wuzzuf.net/jobs/p/j5IuFAHi1HH0-Sales-Manager-Automotive-spare-parts--Assiut-Assiut-Egypt?o=1&l=sp&t=sj&a=sales%20manager|search-v3|navbl&s=33737584

i use the following code to get h2 text

        req3list=[]
        url3 ="https://wuzzuf.net/jobs/p/j5IuFAHi1HH0-Sales-Manager-Automotive-spare-parts--Assiut-Assiut-Egypt?o=1&l=sp&t=sj&a=sales%20manager|search-v3|navbl&s=33737584"
        resp3 = requests.get(url3)
        print(resp3)
        soup3 = BeautifulSoup(resp3.text,'html')
        req3 = soup3.find_all('h2' )
        print(req3)
        for r in req3 :
            req3list.append(r.text.strip())
            
     print(req3list)

but i get that result

<Response [200]>
[]
[]

Charlotte-br560 · 2024-03-20T13:47:32Z

Charlotte-br560
Mar 20, 2024

It seems like the HTML structure of the webpage might have changed or there might be some issue with the selector. I would suggest trying to inspect the webpage directly to confirm the presence and structure of the h2 elements. Additionally, ensure that the webpage content is being fetched correctly.

0 replies

Mustafahubs · 2024-05-25T04:57:17Z

Mustafahubs
May 25, 2024

Hi @anasadiek ,

I see you had an issue initially when trying to scrape the webpage for h2 tags. The problem was due to missing headers in your request, which can result on the website returning a different (often less detailed) version of the page or even blocking your request.

Your updated code, which includes headers, resolves this issue by mimicking a real browser request. This makes the website return the full content, including the h2 tags.

Here is a breakdown of the working code:

import requests
from bs4 import BeautifulSoup

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    'cookie': '_ga=GA1.1.270690289.1716612177; _hjSession_2811765=eyJpZCI6IjYyMDAzMDBhLTY0YzUtNDlhNC1hOTJjLTU5ZjViMDdjOTc5NyIsImMiOjE3MTY2MTIxNzc3NDEsInMiOjAsInIiOjAsInNiIjowLCJzciI6MCwic2UiOjAsImZzIjoxLCJzcCI6MH0=; _clck=z3tdu%7C2%7Cfm2%7C0%7C1606; mp_f65e85d232fcb7d93f8de265b9818087_mixpanel=%7B%22distinct_id%22%3A%20%2218fae0f199dff8-0a5a80dd24dfc8-4c657b58-4b9600-18fae0f199e2497%22%2C%22%24device_id%22%3A%20%2218fae0f199dff8-0a5a80dd24dfc8-4c657b58-4b9600-18fae0f199e2497%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%7D; _ga_E9ENXX0G37=GS1.1.1716612176.1.1.1716612364.60.0.0; _hjSessionUser_2811765=eyJpZCI6Ijc3ZjJjNzkyLWU3ZTYtNTdjZS1hNGI2LTgxMTUwN2MwMDE5YyIsImNyZWF0ZWQiOjE3MTY2MTIxNzc3NDAsImV4aXN0aW5nIjp0cnVlfQ==; _clsk=136vdcb%7C1716612366520%7C2%7C1%7Cw.clarity.ms%2Fcollect; cto_bundle=smnOM193dHRtNEFmd0YlMkZvbjR5YWRVSndtcUZNQ3Z4WW5GSnd2YjNGU3clMkZUNXZiY1BmSUtpaFhYWFRoZjRGcHRLako1Zm9pQ0pWOWVORVd6dmpMQ2EzdG80YUFkVG52ZEJYeGpUcFFGOWpCZ2luaUxvTWxOcFlkNjJSakVUViUyRmNHanVCYw',
    'if-none-match': 'W/"3346f-Yi3+sAbg6NVfMK5sAr3o3rcoEsM"',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Microsoft Edge";v="126"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0',
}

url = 'https://wuzzuf.net/jobs/p/j5IuFAHi1HH0-Sales-Manager-Automotive-spare-parts--Assiut-Assiut-Egypt?o=1&l=sp&t=sj&a=sales%20manager|search-v3|navbl&s=33737584'
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'html.parser')
h2_tags = soup.find_all('h2')

for tag in h2_tags:
    print(tag.text.strip())

Output:

Job Details
Job Description
Job Requirements
Similar Jobs

Key Points:

Headers Inclusion: Adding headers makes your request more "browser-like," which helps in getting the correct response from the server.
User-Agent: This is crucial as it tells the server what type of device and browser is making the request. Many websites serve different content based on this.
Correct Parser: Using 'html.parser' in BeautifulSoup to correctly parse the HTML.

By including these headers, the server is more likely to treat your request as a legitimate one from a web browser, thereby providing you with the correct page content.

Hope this helps!

Best regards,
Mustafa

0 replies

MrShadowRIFAT · 2024-05-25T06:42:08Z

MrShadowRIFAT
May 25, 2024

Hi there,

The issue you're encountering likely stems from the fact that the content on the webpage you're trying to scrape is dynamically loaded via JavaScript. The requests library and BeautifulSoup can only scrape the static HTML content that's initially loaded, and they don't execute JavaScript.

Here are a few steps and a potential solution using Selenium to handle JavaScript-rendered content:

Verify the HTML Content: First, check if the <h2> tags exist in the static HTML content returned by requests. Print out the HTML content:
```
print(resp3.text)
```
If you don't see the <h2> tags in the output, it means they are loaded dynamically.

Use Selenium for Dynamic Content: To scrape content that's loaded via JavaScript, you can use Selenium, a browser automation tool. Here’s how you can modify your code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Set up the WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open the webpage
driver.get('https://wuzzuf.net/jobs/p/j5IuFAHi1HH0-Sales-Manager-Automotive-spare-parts--Assiut-Assiut-Egypt?o=1&l=sp&t=sj&a=sales%20manager|search-v3|navbl&s=33737584')

# Get the page source and parse it with BeautifulSoup
soup3 = BeautifulSoup(driver.page_source, 'html.parser')

# Find and print the h2 text
req3 = soup3.find_all('h2')
req3list = [r.text.strip() for r in req3]
print(req3list)

# Close the WebDriver
driver.quit()

Check the Correct Tag and Class: Ensure that you are targeting the correct HTML elements. Sometimes, tags might have specific classes or IDs.

Using Selenium will allow the browser to fully render the page, including the content loaded by JavaScript, and then BeautifulSoup can parse the rendered HTML.

Let me know if this helps or if you need further assistance!

Best regards,
MrShadowRIFAT

0 replies

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Cant Get H2 Text using Beautifulsoup python 3 #62223

{{title}}

Replies: 4 comments

This comment was marked as off-topic.

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GitHub Community

Cant Get H2 Text using Beautifulsoup python 3 #62223

anasadiek Jul 30, 2023

Select Topic Area

Body

Replies: 4 comments

This comment was marked as off-topic.

Charlotte-br560 Mar 20, 2024

Mustafahubs May 25, 2024

Output:

Key Points:

MrShadowRIFAT May 25, 2024

anasadiek
Jul 30, 2023

Charlotte-br560
Mar 20, 2024

Mustafahubs
May 25, 2024

MrShadowRIFAT
May 25, 2024