GitHub - adregner/beautifulscraper: Python web-scraping library that wraps urllib2 and BeautifulSoup

BeautifulScraper

Simple wraper around BeautifulSoup for HTML parsing and urllib2 for HTTP(S) request/response handling. BeautifulScraper also overrides some of the default handlers in urllib2 in order to:

Handle cookies properly
Offer full control of included cookies
Return the actual response from the server, un-mangled and not reprocessed

Installation

# pip install beautifulscraper

or

# git clone git://github.com/adregner/beautifulscraper.git
# cd beautifulscraper/
# python setup.py install

Examples

Getting started is brain-dead simple.

>>> from beautifulscraper import BeautifulScraper
>>> scraper = BeautifulScraper()

Start by requesting something.

>>> body = scraper.go("https://github.com/adregner/beautifulscraper")

The response will be a plain BeautifulSoup object. See their documentation for how to use it.

>>> body.select(".repository-meta-content")[0].text
'\n\n            Python web-scraping library that wraps urllib2 and BeautifulSoup\n          \n'

The headers from the server's response are accessiable.

>>> for header, value in scraper.response_headers.items():
...     print "%s: %s" % (header, value)
...
status: 200 OK
content-length: 36179
set-cookie: _gh_sess=BAh7BzoQX2NzcmZfdG9rZW4iMUNCOWxnbFpVd3EzOENqVk9GTUFXbDlMVUJIbGxsNEVZUFZJNiswRjhwejQ9Og9zZXNzaW9uX2lkIiUyNmQ2ODE5ZDdiZjM3MTA2N2VlZDk3Y2VlMDViYzI2OA%3D%3D--5d31df13d5c0eeb8f3cccb140392124968abc374; path=/; expires=Sat, 01-Jan-2022 00:00:00 GMT; secure; HttpOnly
strict-transport-security: max-age=2592000
connection: close
server: nginx
x-runtime: 98
etag: "1c595b5b6a25eb7f021e68c3476d61da"
cache-control: private, max-age=0, must-revalidate
date: Wed, 31 Oct 2012 02:14:08 GMT
x-frame-options: deny
content-type: text/html; charset=utf-8

So is the response code as an integer.

>>> type(scraper.response_code), scraper.response_code
(<type 'int'>, 200)

The scraper will keep track of all cookies it sees via the cookielib.CookieJar class. You can read the cookies if you'd like. The Cookie object's are just a collection of properties.

>>> scraper.cookies[0].name
'_gh_sess'

See the pydoc for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
beautifulscraper		beautifulscraper
.gitignore		.gitignore
CHANGELOG		CHANGELOG
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beautifulscraper

beautifulscraper

.gitignore

.gitignore

CHANGELOG

CHANGELOG

README.md

README.md

setup.py

setup.py

Repository files navigation

BeautifulScraper

Installation

Examples

About

Releases

Packages

Contributors 3

Languages

adregner/beautifulscraper

Folders and files

Latest commit

History

Repository files navigation

BeautifulScraper

Installation

Examples

About

Topics

Resources

Stars

Watchers

Forks

Languages