Skip to content
felipecsl edited this page Feb 3, 2013 · 12 revisions

Properties

In this documentation we refer to the word property several time, so let's define it: For Wombat, properties are the pieces of information that you want to extract from the parsed document. In previous example, they were some_data and another_info. You can choose your own names for them, again, just respecting the convention rules (no spaces or special characters). Properties can be of a few different types: Text, HTML, List, Iterator or Follow. We'll get into the details of each one of them.

As of Wombat 2.1.2, properties also accept hashes for the selector. You can find an example in the project Readme.

Text Properties

This is the default type of property if you don't specify anything. It will search the document for the provided selector and return the text of the first matching element.

HTML properties

Sometimes you want to extract verbatim content from the page. In order to do that, you can specify additional parameters for your property. After the selector, the next parameter is the content format. If you specify :html format, the entire inner_html string will be returned for that selector. Again with our GithubScraper:

class GithubScraper
  include Wombat::Crawler
  base_url "http://www.github.com"
  path "/"

  what_is "css=.column.secondary p", :html
end

puts GithubScraper.new.crawl
# outputs => 

{
  "what_is"=>"GitHub is the best way to collaborate with others.  Fork, send pull requests and manage all your <strong>public</strong> and <strong>private</strong> git repositories."
}

As you can see, <strong> elements are preserved in the resulting hash, since the property is marked as :html

Lists properties

Works just like :text properties, but returns all the matching nodes from the document.

#coding: utf-8
require 'wombat'

class RubyGemsScraper
  include Wombat::Crawler

  base_url "http://www.rubygems.org"
  path "/"

  gems do |g|
    g.new "css=#new_gems li", :list
    g.most_downloaded "css=#most_downloaded li", :list
    g.just_updated "css=#just_updated li", :list
  end
end

p RubyGemsScraper.new.crawl
# outputs =>

{
  "gems"=>{
    "new"=>["hashbang (0.0.1.alpha)", "zunari (0.1.0)", "fuzzy-string (0.1.0)", "cul_image_props (0.1.0)", "pea (0.0.1)"], 
    "most_downloaded"=>["mime-types-1.17.2 (3,866)", "json-1.6.5 (3,860)", "rake-0.9.2.2 (3,506)", "treetop-1.4.10 (3,484)", "multi_json-1.0.4 (3,404)"], 
    "just_updated"=>["wombat (0.2.4)", "tengine_job (0.6.10)", "omf_rc (5.4.1)", "rapnd (0.1.1)", "rspec-puppet (0.1.2)"]
  }
}

Other properties

Whenever wombat finds a property which selector does not match a string starting with xpath= or css=, it is gonna ignore it and return itself as a string in the resulting hash. Eg.:

result = Wombat.crawl do
  base_url "http://www.rubygems.org"
  path "/"

  static_data "This goes into the result"
  symbol_data :this_symbol
end

p result
# outputs =>

{
  "static_data"=>"This goes into the result", 
  "symbol_data"=>"this_symbol" 
}