Skip to content

compmonk/Data-Collection-and-Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data-Collection-and-Web-Scraping

Overview

A Study Session on Data Collection and Web Scraping

Learning Objectives

  • Use Beautiful Soup to parse HTML.
  • How to use Chrome Developer tools to identify the HTML elements.
  • Scrape websites using Beautiful Soup.
  • Automate scraping using Splinter.
  • Collect and organize scraped data in a Pandas Data Frame.

0. Prework

0.1 Installations

The following tools and packages are required for the successful working of the activities:

  • Chrome
  • Chrome Driver
  • Beautiful Soup
  • requests
  • splinter
  • html5lib
  • lxml
  • pandas

Chrome

Installation instructions for Chrome Driver

pip install requests
pip install beautifulsoup4
pip install "splinter[selenium4]"
pip install html5lib
pip install lxml
pip install pandas
0.2 Installation Check

Run the Installation Check to verify all installations. Install packages as seemed necessary.

1. HTML, CSS and Javascript Scraping

Activity Time: 0:20 Elapsed Time: 0:20
1.1 Building A Webpage (10 min)

Starter : index.html

Solution : index.html

1.2 Scraping A Webpage (10 min)

Starter : 1_2_Scraping_A_Webpage.ipynb

Solution : 1_2_Scraping_A_Webpage.ipynb

2. Scraping the World Wide Web

Activity Time: 0:25 Elapsed Time: 0:45
2.1 Inspect using Chrome Dev Tools (5 min)

Site: Laptops Site

2.2 Webscraper (20 min)

Site: Laptops Site

Starter : 2_2_Webscraper.ipynb

Solution : 2_2_Webscraper.ipynb

3. Splinter

Activity Time: 0:30 Elapsed Time: 1:15
3.1 Stacking and Over Flowing (15 min)

Site: Stack Over Flow

Starter : 3_1_Stacking_and_Over_Flowing.ipynb

Solution : 3_1_Stacking_and_Over_Flowing.ipynb

3.2 Whats New (15 min)

Site: Global Voices

Starter : 3_2_Whats_New.ipynb

Solution : 3_2_Whats_New.ipynb

4. Data Collection

Activity Time: 0:25 Elapsed Time: 1:40
4.1 Framing the Quotes (25 min)

Site : Quotes to Scrape

Starter : 4_1_Framing_the_Quotes.ipynb

Solution : 4_1_Framing_the_Quotes.ipynb

5. Q & A

Activity Time: 0:10 Elapsed Time: 1:50

End Session