Skip to content

rebelcoding/startScraping

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Start Scraping with Python

This repository uses LXML to scrape webpages. Beautiful Soup is an awesome package, but for those who want more fine-grain control of the pages they scrape, LXML is what runs underneath the hood in Beautiful Soup.

That said, we're going to start with the basics, and add info as requested by users.

Python Variable Types

A quick note about numbers

In programming there are two basic types of numbers, integers and floats. Integers are whole numbers, positive or negative (i.e. 4, 231, -34). Floats are numbers that utilize decimal places (i.e. 3.76, .12, 32.00).

Back to Basics

A string is a variable that contains only text

test = 'This is a test variable that we are declaring and defining at the same time'

A list is a collection of variables. Lists can contain strings, numbers, other lists, and dictionaries.

new_list = ['red', 'blue', 12, .3.45,['dog', 'cat', 'taco'], {color: 'red', animal: 'bird', travel_method: 'flight'}]

A dictionary is a collection of key/value items. The last item in our previous list, is a dictionary. There are 3 keys, color, animal, and travel_method. The ':' is used to separate the key, from the value. It is important to note, that this is what a dictionary may look like when it is printed out. When creating dictionaries, the format will look different.

Just like lists, dictionary values can be strings, numbers, lists, or other dictionaries.

new_dictionary =  {color: 'blue', animal: 'fish', travel_method: 'swim'}

Homebrew Installation

Homebrew is a package manager for OSX. You need to have a Xcode installed. Xcode is software that allows one to write programs for Apple products; which is entirely immaterial for our needs. What is important is that is will help us attain a collection of C libraries, that are fundemental to our ability to program. You may need to download Xcode from the Apple Store; it should be free.

Now that you've got that downloaded, we can run the following code:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

This code is being take directly from the website previously linked to, as well as adding the following information:

The script explains what it will do and then pauses before it does it.

Now that we have Homebrew installed ...

... we can run the following code to install a version of Python that will not affect the core Python used by the rest of the operating system. This can be considered a form of decoupling the operating environment from a development environment. Moving on, we're also going to install another program for future use.

brew install python
brew install git

VirtualEnv and the Development Environment

The excitement is only just beginning. We're going to take this decoupling one step further, and create separate Python development environments. IF we mess anything up, we can just delete and folder and try again! No big deal.

A Python package manager ought to have also been installed when we brewed up Python with our previous code. So now we'll get into using Python's package manager, pip.

pip install virtualenv

Thus far it has not mattered where we are on the file tree in Terminal. So now we'll go through a few basic commands as a list:

  • pwd - present working directory
  • mkdir - seems self-explanatory, make a new directory/folder
  • cd - change directory, or move to a different folder
  • ls - list the contents of a folder

First we'll want to run the following command in order to make sure we are in our home directory, and we'll add the word Desktop:

cd ~/Desktop

Next we'll run a command to create new folder that will house all of our future projects:

mkdir PickYourOwnName

Voila, there ought now be a new folder sitting on your computer's desktop!

Here we will use our previously installed VirtualEnv program to create a fancy fresh Python development environment, but first we need to change into our directory (and again, you are encouraged to pick your own name for this development environment; remember, these environments are meant to be disposable):

cd TheNameOfTheFolderYouCreatedInThePreviousCommand
virtualenv TheNameYouChoose
cd TheNameYouChose
ls -la

What we've just done is moved into the folder that was previously created on the Desktop. Next we created a virtual Python environment, and then we moved into the folder that was just created and named whatever name was chosen. Lastly, we printed out the contents of out present working directory.

We added flags to our last command, prepending them with a dash. The l flag offsets the command to display more detailed information about the items listed, besides just their name. The a flag tells the command to also display hidden files; these are files in which the names are prepended with a '.'.

Activating the Virtual Environment

We are going to use a series of commands to activate the virtual Python environment, and then install a few Python packages that we need for scraping webpages.

source bin/activate
pip install requests
pip install lxml

It is suggested that you create yet another folder, in which to house your scraping scripts. This is all for the purposes of organization, and other open source practices. Regardless, if you'd like to recreate the Python script that is provided, you may do so by opening up the basic text editor, and copying the code. This really is a better practice that copying and pasting; you can choose to copy the notes, or not. Remember to save the script you write to the scripts folder, that is in the virtual Python environment folder, that is in the folder we initially created on the Desktop.

After that has been accomplished one could run the following command to run the file:

cd TheNameOfTheScriptsFolder
python TheNameOfTheScriptThatScrapesTheWebsite.py

If there are any issues with the following commands, don't hesitate to ask. There are always at least a could dozen different ways any process can boink at any time; it is definitely something to get used to.

Future additions

  • Build Development environment
  • How to scrape pages that use JavaScript

About

Small tutorial to teach folks how to scrape webpages with Python using LXML.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%