Skip to content

oxylabs/web-scraping-data-parsing-beautiful-soup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Web Scraping and Data Parsing Using Beautiful Soup

Oxylabs promo code

This project provides a clear and concise example of how to fetch content from a website using the Requests module and then parse it using BeautifulSoup.

Setting Up

To run this example you will need Python 3. We recommend setting up a virtual environment

Install dependencies by running

$ pip install requests
$ pip install BeautifulSoup4
$ pip install pandas

Note: You can also install them by using the requirements.txt file included in this repository.

$ pip install -r src/requirements.txt

Web Scraping

A mock bookstore website called https://books.toscrape.com is our scraping target.

Use the requests module to fetch a page from it

response = requests.get('https://books.toscrape.com')

Once the response is retrieved, check whether the request was successful or not by verifying the status_code property

if response.status_code != 200:
    print('Page not found')
    exit(1)

print('Successfully fetched the page')

Save the script as src/scrape.py and run it.

$ python3 src/scrape.py

Successfully fetched the page

The requests module has successfully retrieved the html content from the website and now all that's left is to parse it.

A working example can be found here

Parse HTML

Take a look at the structure of the HTML that you're trying to scrape.

<article class="product_pod">
    ...
    <h3>
        <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
    </h3>
    ...
</article>

The book info is neatly wrapped in an article tag. Inside the article, there's a heading (h3) that contains an anchor (a), which contains the title of the book inside an attribute.

<a ... title="A Light in the Attic">...</a>

To parse this HTML content use the BeautifulSoup4 library.

Firstly, import BeautifulSoup

from bs4 import BeautifulSoup

Then, create an instance of the BeautifulSoup class and load the HTML content that has been retrieved from the web page previously.

soup = BeautifulSoup(response.content, 'html.parser')

Retrieve all the article tags

articles = soup.find_all('article')

Define a titles array that will hold all the book titles extracted from the current HTML

titles = []

Iterate through every article to extract the title attribute of the anchor tag. You may want to print the title as well, just to see whether the script works as expected

for article in articles:
    title = article.h3.a.attrs['title']
    titles.append(title)
    print(title)

Save the script as src/parse.py and run it

$ python3 src/parse.py                                              
Successfully fetched the page         
A Light in the Attic       
Tipping the Velvet
Soumission
...

All the book titles have been parsed successfully!

A working example can be found here

Save to csv

Printing everything to standard output can become messy at times. Instead, it is a good idea to save the results into a CSV file.

Start by deleting the print function.

print(title) # delete this!

Next, create a data frame object by using the pandas library. In the constructor, pass a dictionary that contains the name of the column ("Title") and an array of titles that was parsed previously.

data_frame = pandas.DataFrame({'Title': titles})

Finally, save the data frame to a file by using the to_csv method

data_frame.to_csv('books.csv', index=False, encoding='utf-8')

Save the script as src/save.py and execute it.

$ cd src
$ python3 save.py
Successfully fetched the page

Use the cat Unix utility to print the csv file.

$ cat books.csv
Title
A Light in the Attic       
Tipping the Velvet
Soumission
...

The newly created file now contains all the book titles from the web page.

The final version of the script can be found here

Releases

No releases published

Packages

No packages published

Languages