Web Scraping in python: From HTML Soup to Tidy Data

A workshop created by Sami Friedrich for the BioData Club Workshop Series.

Overview

The internet is overflowing with data ripe for harvesting. The challenge is that not all of that data is formatted neatly or easily accessible. Enter the web scraping multitool! With the power of web scraping, the contents of virtually any webpage can be transformed into analysis-ready data. During this workshop, you’ll learn using python how to:

Scavenge the contents of an HTML webpage
Extract only the data you want
Format the data into a table

Libraries used:

requests
BeautifulSoup4
pandas

Developer tools used:

the browser Inspector/Inspect tool
Google Colab

Prerequisites

Some basic python knowledge (looping through list elements, passing arguments to functions, writing basic functions) is a prerequisite for this workshop.

If you are new to python or want to brush up on these topics before the workshop, check out these free tutorials:
- http://introtopython.org/introducing_functions.html
- http://introtopython.org/lists_tuples.html#Lists-and-Looping

We will also be working with HTML, and no prior experience is necessary. However, it will be helpful to have a surface-level understanding of HTML elements - namely, their open/close tag structure, and how they nest within each other.

If you are not familiar with HTML elements or tags, please take a look at this short overview on HTML Basics before beginning.

Files

webscraping_workshop.ipynb is the Jupyter Notebook (without solutions) for the workshop. Follow the badge at the top to open in Google Colab, or download and run locally (just make sure you've already installed the libraries listed above.)
solutions_to_webscraping_workshop.ipynb contains solutions to the Jupyter Notebook exercises.
taphunter_belmont_station.html is the downloaded .html file for the webpage this workshop is designed to scrape. If you're running things locally, be sure to place this file in the same folder as webscraping_workshop.ipynb.

Other materials

The Google Slides presentation accompanying this workshop can be found here.

Author

Sami Friedrich, PhD candidate at Oregon Health and Science University. Please feel free to reach out with questions or comments!

License

This project is licensed under the MIT License (see LICENSE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

solutions_to_webscraping_workshop.ipynb

solutions_to_webscraping_workshop.ipynb

taphunter_belmont_station.html

taphunter_belmont_station.html

webscraping_workshop.ipynb

webscraping_workshop.ipynb

Repository files navigation

Web Scraping in python: From HTML Soup to Tidy Data

Overview

Libraries used:

Developer tools used:

Prerequisites

Files

Other materials

Author

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
LICENSE		LICENSE
README.md		README.md
solutions_to_webscraping_workshop.ipynb		solutions_to_webscraping_workshop.ipynb
taphunter_belmont_station.html		taphunter_belmont_station.html
webscraping_workshop.ipynb		webscraping_workshop.ipynb

License

BioData-Club/webscraping_workshop

Folders and files

Latest commit

History

Repository files navigation

Web Scraping in python: From HTML Soup to Tidy Data

Overview

Libraries used:

Developer tools used:

Prerequisites

Files

Other materials

Author

License

About

Resources

License

Stars

Watchers

Forks

Languages