Razor is a simplistic web scraper built on watir-webdriver.
Razor avoids ‘magical’ ruby code. It aims to be a straight-forward tool that uses the web browser to access web pages and xpath (only xpath) to parse the DOM of a web page.
require 'razor' razor = Razor.new(:blade => :firefox) razor.goto "http://www.google.com" razor.enter_text "//input[@name='q']", "Buy" razor.submit results = razor.shave do value(:stats, "//p[@id='resultStats']") do |x| x.text.scan(/of about (.+) for/)[0][0] end array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x| x.text end end results # => # { # :suggestions=> ["buy.com coupon", "buy.com promotion code", # "overstock", "newegg", "deals2buy", "shopping", # "buy cars", "buy costumes"], # :stats=>"1,740,000,000" #}
Since Razor uses watir-webdriver, it can hypothetically support all the browsers Selenium does. :ie, :chrome, :firefox, even :remote
Razor is developed on Firefox because of Firebug. All testing is done in Firefox. All bugs will be addressed as they relate to Firefox. Any other support is welcome from our users.
Shaving is as simple as possible, while maintaining power. Shaving returns a hash of name => block.call(element(s).at(xpath))
razor.shave do value name, xpath, block end
value returns a single value (the first matching xpath), and processes it through the block, if a block is specified.
The block takes an Watir::Element (obtained from the xpath). See the watir documentation for more information watir
razor.shave do array name, xpath, block end
Array is the same as value, except this returns an array of results.
We provide a convenience method for next page elements. The next_page method will continue to click Next until it cannot find the element.
razor.shave do next_page "//a[text()='Next']" end
Since all bad xpaths will immediately break from the next page loop, it can be hard to debug.
To debug your xpath, add:
razor.click "xpath"
Which will throw an error if the xpath cannot be found.
Next page will also call a block if one is specified. This block will be called right before clicking next_page.
next_page "//a[text()='Next']" do @@next_page_count += 1 end
If you have several hundred pages to traverse you may want to consume this information instead of accumulating it. In that situation you may look at with_results.
razor.shave(:flush_results_between_pages => true) do #... next_page "a[text()='Next']" with_results do |results| puts results.inspect end end
Note: with_results is often useful with flush_results_between_pages. This flag will empty results between pages instead of accumulating them.
There is a finalizer that closes the browser window. However, Ruby finalizers aren’t great, and sometimes the window will need to be closed manually.
To do so:
razor.close
We poll each requested shaving every 0.5 seconds until we return a result, or 5 seconds elapse. You can override this behaviour if it does not suite the website you are polling.
To do so, add the following options to razor.shave
# Will poll 10 times a second for a total of 100 polls razor.shave(:number_of_steps => 100, :step_time => 0.1) do ... end
Razor is capable of validating the results of each page scraping. There is a :validation_limit option on shave that determines how many times to attempt validation on a page before giving up (while sleeping for :step_time in between).
razor.shave(:validation_limit => 5, :step_time => 1) do # This will call the validations for each page 5 times, sleeping 1 second in between before raising an exception end
validation_limit defaults to 10
Shave#validate takes a block that validates a page’s shaving result hash arbitrarily.
razor = Razor.new(:blade => :firefox) razor.goto "http://www.google.com" razor.enter_text "//input[@name='q']", "Buy" razor.submit results = razor.shave do array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x| x.text end validate do |r| r[:suggestions].length == 8 end end
validate only runs if there is a next_page when using next_page.
You can access the webdriver straight from:
razor.webdriver
This webdriver follows the interface of watir
You can access the selenium object with:
razor.webdriver.driver
The selenium object follows the interface of selenium
-
Fork the project.
-
Make your feature addition or bug fix.
-
Add tests for it. This is important so I don’t break it in a future version unintentionally.
-
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
-
Send me a pull request. Bonus points for topic branches.
There are many great web scraping utilities available for Ruby (Scrubyt, ScrAPI to name two).
None of them use Firefox natively (Scrubyt is able to with JSSH, but JSSH is buggy).
Copyright © 2010 Mikkel Garcia. See LICENSE for details.