Skip to content

nuslds/intro-beautifulsoup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to Web Scraping with BeautifulSoup in Python

This workshop introduces how to extract information from a static HTML website with BeautifulSoup in Python. Please note that it is designed for participants with no programming experience.

Workshop Materials

Introduction

1. What is Web Scraping?

Web scraping is a process of retrieving information from web services in an automated way.

2. Why Web Scraping?

Web scraping saves you from the headaches of repeatedly copying or downloading data from different websites. It creates datasets for data-driven projects by simplifying and automating the process of extracting data online and transforming scrapped data into structured formats.

Source: How Web Scraping is Transforming the World with its Applications

3. What to Consider before Web Scraping

  • Terms and Conditions of the hosting sites
    • Do make sure you read the terms and conditions of the websites carefully and understand the restrictions.
    • Some sites may have robot.txt files that disallow scraping of particular content.
    • Check if an API exists or if the data is otherwise available for download or sale.
  • The bandwidth of the hosting sites
    • To avoid excessive burden to the hosting sites, try to limit the bandwidth use, e.g., wait a few seconds between requests and try to scrape during off-peak hours.

Source: Web Scraping, Columbia University Mailman School of Public Health

Overview

Task Today

To demonstrate, we will extract quotes from Quotes to Scrape. This is a project created by Scrapinghub (Github repo). We will create a reusable function that scrapes quotes from the website by page numbers. The outputs will be converted into tabular format and export into CSV.

# scrape quotes from page 1 to page 10
outputs = scrape_quotes(1, 10)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published