GitHub - santinic/htmlmatch: Python tool for automatic data scraping from Html templates

htmlmatch: Automatic data scraping

Suppose you have a page with a list of videos (videos.html), and you want to get all the videos:

<html>
<head><title>Example</title></head>
<body>
<div class="video">
        <a href="watch?v=0001">Title first video</a><img src="preview1.jpg"/></div>
<div class="video">
        <a href="watch?v=0002">Title second video</a><img src="preview2.jpg"/></div>
<div class="video">
        <a href="watch?v=0003">Title third video</a><img src="preview3.jpg"/></div>
...
</body>
</html>

You can easily extract the data from this web page, creating an extraction template like this (template.html):

<div class="video"><a href="watch?v=$code$">$title$</a><img src="$preview$"/></div>

Just put $variable$ where you want. Now if you run the script against videos.html and template.html, you get the raw data:

claudio@laptop:~$ ./htmlmatch.py videos.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

videos_page = urllib2.urlopen("http://www.videos-website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(videos_page, pattern)
for map in matches:
    for k, v in map.iteritems():
        print k, v
    print

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
htmlmatch.py		htmlmatch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

htmlmatch.py

htmlmatch.py

Repository files navigation

htmlmatch: Automatic data scraping

About

Releases

Packages

Languages

santinic/htmlmatch

Folders and files

Latest commit

History

Repository files navigation

htmlmatch: Automatic data scraping

About

Resources

Stars

Watchers

Forks

Languages