scraping

Boilerplate Code / Tutorial for Web Scraping

Scraping is totally a trial and error process. It's best dealt in a case-by-case basis. I've included a few examples, though, that highlight different ways of scraping data from websites.

None of them work perfectly, but take them as a starting point and as reference in how to extract different types of information.

I've made it so all you have to do is download this repository as a zip, extract, and start coding. But that means you have to work from this directory, which isn't the best practice. You can follow along below if you want to install it on your machine.

Installing Stuff

There are 2 components for web scraping: a process for getting the data from the web and a process for parsing it. Python is probably the easiest to understand so I'm going to use that today.The same concepts apply to other languages, though. And the two libraries I'll make use of are Requests and Beautiful Soup.

You'll need 5 things:

Python
Pip (Python Package/Module Manager)
Requests (Python Module)
BeautifulSoup4 (Python Module)
lxml (Python Library for Parsing Websites)

You can install everything super quickly:

python -m pip install requests bs4 lxml --user

Finding Your Dataset

The world wide web is your oyster, I feel like you can make a dataset out of anything these days.

Some good ones include making requests to public APIs (Twitter, Spotify, Google, etc), websites with tables (Davidson, ESPN, etc), and websites with formatted information (NYTimes, Google Search, etc). But if it exists on the web, you can scrape it. Some pages are harder than others, but it can be done. Websites follow a similar structure to one anther, so feel free to follo this process and apply it to other websites that interest you.

I'm going to use ESPN, the Twitter API, and the Davidson Website as examples for this demo.

The Setup

After you've got everything installed, make a directory to keep track of your things. If you download this repository it'll make a "scraping-master" directory, so I'll refer to everything in relation to that. Feel free to rename it, though.

Open terminal and enter the following:

cd $(find . -iname 'scraping-master')

The main file is going to be 'scraping.py'

Actually Scraping

The are 3 elements of scraping: the url, the sauce, and the soup. The url is where you want to point Requests, it's where the data is stored on the web. The sauce is what's returned from requests -- it's the data, it just doesn't make sense yet. And then the soup is what you're after -- you're able to start to understand the data at this point.

You can edit the url to the page that has your data, and know that it's possible to nest multiple requests (to different urls) within a single session. You might make a script to go to one page and then open another and get more data from there. I've used this in paginated pages, or to access data that stems from a calendar.

Step 1: the url

Go to the website with your source (ESPN) and navigate until you have the data on the screen. Copy that url and paste it into the code (in your url variable).

Step 2: the sauce

This is what the website spits back at you. It's html, but not quite parsable. We pass this to bs4, though, and it gets parsed.

Step 3: the soup

This is where we grab the information from. This is an object you can query. Some combination of find ("...":{"class":"..."}).get("...").

Step 4: the output

Now that you've found the information you want, you can throw it out into a csv. You might write the file line by line based on table rows in the sauce, you could also pick up your own information from the website and throw that into a file.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md
scraping.py		scraping.py
twitter-readme.txt		twitter-readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

scraping.py

scraping.py

twitter-readme.txt

twitter-readme.txt

Repository files navigation

scraping

Installing Stuff

Finding Your Dataset

The Setup

Actually Scraping

Step 1: the url

Step 2: the sauce

Step 3: the soup

Step 4: the output

About

Releases

Packages

Languages

License

tucraig/scraping

Folders and files

Latest commit

History

Repository files navigation

scraping

Installing Stuff

Finding Your Dataset

The Setup

Actually Scraping

Step 1: the url

Step 2: the sauce

Step 3: the soup

Step 4: the output

About

Resources

License

Stars

Watchers

Forks

Languages