-
Notifications
You must be signed in to change notification settings - Fork 132
Properties
In this documentation we refer to the word property several time, so let's define it: For Wombat, properties are the pieces of information that you want to extract from the parsed document. In previous example, they were some_data and another_info. You can choose your own names for them, again, just respecting the convention rules (no spaces or special characters). Properties can be of a few different types: Text, HTML, List, Iterator or Follow. We'll get into the details of each one of them.
As of Wombat 2.1.2, properties also accept hashes for the selector. You can find an example in the project Readme.
This is the default type of property if you don't specify anything. It will search the document for the provided selector and return the text of the first matching element.
Sometimes you want to extract verbatim content from the page. In order to do that, you can specify additional parameters for your property. After the selector, the next parameter is the content format. If you specify :html
format, the entire inner_html
string will be returned for that selector. Again with our GithubScraper
:
class GithubScraper
include Wombat::Crawler
base_url "http://www.github.com"
path "/"
what_is "css=.column.secondary p", :html
end
puts GithubScraper.new.crawl
# outputs =>
{
"what_is"=>"GitHub is the best way to collaborate with others. Fork, send pull requests and manage all your <strong>public</strong> and <strong>private</strong> git repositories."
}
As you can see, <strong> elements are preserved in the resulting hash, since the property is marked as :html
Works just like :text
properties, but returns all the matching nodes from the document.
#coding: utf-8
require 'wombat'
class RubyGemsScraper
include Wombat::Crawler
base_url "http://www.rubygems.org"
path "/"
gems do |g|
g.new "css=#new_gems li", :list
g.most_downloaded "css=#most_downloaded li", :list
g.just_updated "css=#just_updated li", :list
end
end
p RubyGemsScraper.new.crawl
# outputs =>
{
"gems"=>{
"new"=>["hashbang (0.0.1.alpha)", "zunari (0.1.0)", "fuzzy-string (0.1.0)", "cul_image_props (0.1.0)", "pea (0.0.1)"],
"most_downloaded"=>["mime-types-1.17.2 (3,866)", "json-1.6.5 (3,860)", "rake-0.9.2.2 (3,506)", "treetop-1.4.10 (3,484)", "multi_json-1.0.4 (3,404)"],
"just_updated"=>["wombat (0.2.4)", "tengine_job (0.6.10)", "omf_rc (5.4.1)", "rapnd (0.1.1)", "rspec-puppet (0.1.2)"]
}
}
Whenever wombat finds a property which selector does not match a string starting with xpath=
or css=
, it is gonna ignore it and return itself as a string in the resulting hash.
Eg.:
result = Wombat.crawl do
base_url "http://www.rubygems.org"
path "/"
static_data "This goes into the result"
symbol_data :this_symbol
end
p result
# outputs =>
{
"static_data"=>"This goes into the result",
"symbol_data"=>"this_symbol"
}