Skip to content

uc-python/python-for-web-scraping

Repository files navigation

Python for Web Scraping

Course Description

This 3-week course covers the fundamentals of interacting with web sites using Python. Students will learn how to fetch web pages and parse useful information out of HTML code. To accomplish this, the requests and beautifulsoup libraries will be covered in some depth, and the pandas library will be used to wrangle the scraped data. The course will also introduce the selenium package for interacting with Javascript-oriented sites and the scrapy package for recursively crawling multiple web pages.

Objectives

  1. Students will be able to pull data from a simple website into Python and filter, sort, and aggregate it (primary goal of the course).
  2. Students will be able to explain the basics of how web pages are represented as HTML and list the common tools available in Python to convert that HTML into useful information.
  3. Students will be able to articulate the differences between a static web page and a dynamic one. They will be able to explain, at a high level, how the selenium library can help parse dynamic pages and will know where to look for more information on selenium.
  4. Students will be able to articulate the use cases that necessitate web crawlers. They will be able to explain, at a high level, how the scrapy library can be used to build web crawlers.

Prerequisites

Students should be familiar with the core concepts of Python: assignments, if-statements, for-loops, using functions, and ideally writing functions as well. Experience with Jupyter notebooks (or JupyterLab) and Pandas DataFrames will make the course easier but is not required.

Attendees who need a refresher might consider two excellent Python books by Jake VanderPlas:

  • A Whirlwind Tour of Python – This book is an overview of the very basics of Python, and is meant as an introduction to the following, longer text. It's available for free online here.
  • The Python Data Science Handbook – An absolutely incredible resource, managing to cover enormous breadth and reasonable depth of the core Python libraries for data science, without being dry. Available free online here.

Format

The course content will be delivered via pre-recorded lectures, which I will release weekly in chunks of approximately 5 hours for 3 weeks. I will also distribute some supplemental exercises to go along with the lectures, meant to take a few hours to complete. We will meet live for 4 hours each Saturday during the course, 9am-1pm eastern. During this time, I will briefly review the week's lectures; students are encouraged to use the remaining time to complete the exercises, as I and a TA will be available to answer questions and help with issues during this session.

Preparation

Students will need to have Python installed on their computer along with an editor, ideally JupyterLab, though attendees who are more comfortable with another tool are welcome to use it. To install JupyterLab as part of the Anaconda Python distribution, follow these steps:

  1. Visit the Anaconda download page.
  2. Select your appropriate operating system.
  3. Click the "Download" button for Python 3.8 - this will begin to download the Anaconda installer.
  4. Open the installer when the download completes, and then follow the prompts. If you are prompted about installing PyCharm, elect not to do so.
  5. Once installed, open the Anaconda Navigator and launch JupyterLab to ensure it works.

Additionally, you may want to download the course materials from this repository. There are two ways to do so:

  1. Clone the repo - If you're familiar with this, you can clone this repository.
  2. Download the files as a zip - use this link.

Agenda

Sunday, May 16 First lecture series released
Saturday, May 22 Live session, 11:00-15:00 eastern
Sunday, May 23 Second lecture series released
Saturday, May 29 Live session, 11:00-15:00 eastern
Sunday, May 30 Third lecture series released
Saturday, June 5 Live session, 11:00-15:00 eastern

Your Instructor

If you have any specific questions prior to or during the class, you may reach out to me directly via GitHub or email:

About

Python for Web Scraping Workshop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published