Skip to content
Kurt Edelbrock edited this page Feb 4, 2015 · 5 revisions

XML and Namespaces

You can add namespaces to properties. They will be used to query the document along with the selector. This can be useful to parse XML files. Yes, wombat can also scrape XML files. Yay! The syntax is:

class LastFmScraper
  include Wombat::Crawler
  base_url "http://ws.audioscrobbler.com"
  path "/2.0/?method=geo.getevents&location=San%20Francisco&api_key=<YOUR_LASTFM_API_KEY>"
  document_format :xml

  locations 'xpath=//event', :iterator do
    latitude "xpath=./venue/location/geo:point/geo:lat", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
    longitude "xpath=./venue/location/geo:point/geo:long", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
  end
end

Note that we used above the option document_format :xml. This is another special property that tells the type of document we are supposed to parse. It defaults to :html, so usually you won't need to specify this. If you want to parse a xml, you can say that by format :xml. The only 2 formats supported so far are html and xml.

If you are going to specify a namespace, you have to also say the type of property you are requesting (:text, :html or :list) as the second argument, before the namespace and after the selector. The namespace must be a hash with keys being the namespace name, and values being the namespace url.