Skip to content

mikkel/razor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Razor

Razor is a simplistic web scraper built on watir-webdriver.

Razor avoids ‘magical’ ruby code. It aims to be a straight-forward tool that uses the web browser to access web pages and xpath (only xpath) to parse the DOM of a web page.

Usage

Example 1: Scraping Google for # of hits / suggestions

require 'razor'

razor = Razor.new(:blade => :firefox)
razor.goto "http://www.google.com"
razor.enter_text "//input[@name='q']", "Buy"
razor.submit

results = razor.shave do
  value(:stats, "//p[@id='resultStats']") do |x|
    x.text.scan(/of about (.+) for/)[0][0]
  end
  array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x|
    x.text
  end
end

results # =>
# {
#   :suggestions=> ["buy.com coupon", "buy.com promotion code",
#                   "overstock", "newegg", "deals2buy", "shopping",
#                   "buy cars", "buy costumes"],
#   :stats=>"1,740,000,000"
#}

Supported Browsers

Since Razor uses watir-webdriver, it can hypothetically support all the browsers Selenium does. :ie, :chrome, :firefox, even :remote

Razor is developed on Firefox because of Firebug. All testing is done in Firefox. All bugs will be addressed as they relate to Firefox. Any other support is welcome from our users.

Razor#shave

Shaving is as simple as possible, while maintaining power. Shaving returns a hash of name => block.call(element(s).at(xpath))

razor.shave do
  value name, xpath, block
end

value returns a single value (the first matching xpath), and processes it through the block, if a block is specified.

The block takes an Watir::Element (obtained from the xpath). See the watir documentation for more information watir

razor.shave do
  array name, xpath, block
end

Array is the same as value, except this returns an array of results.

next_page xpath

We provide a convenience method for next page elements. The next_page method will continue to click Next until it cannot find the element.

razor.shave do
  next_page "//a[text()='Next']"
end

Since all bad xpaths will immediately break from the next page loop, it can be hard to debug.

To debug your xpath, add:

razor.click "xpath"

Which will throw an error if the xpath cannot be found.

Next page will also call a block if one is specified. This block will be called right before clicking next_page.

next_page "//a[text()='Next']" do
  @@next_page_count += 1
end

with_results

If you have several hundred pages to traverse you may want to consume this information instead of accumulating it. In that situation you may look at with_results.

razor.shave(:flush_results_between_pages => true) do
  #...
  next_page "a[text()='Next']"

  with_results do |results|
    puts results.inspect
  end
end

Note: with_results is often useful with flush_results_between_pages. This flag will empty results between pages instead of accumulating them.

Closing webdriver

There is a finalizer that closes the browser window. However, Ruby finalizers aren’t great, and sometimes the window will need to be closed manually.

To do so:

razor.close

Waiting for page load

We poll each requested shaving every 0.5 seconds until we return a result, or 5 seconds elapse. You can override this behaviour if it does not suite the website you are polling.

To do so, add the following options to razor.shave

# Will poll 10 times a second for a total of 100 polls
razor.shave(:number_of_steps => 100, :step_time => 0.1) do
  ...
end

Page Validation

Razor is capable of validating the results of each page scraping. There is a :validation_limit option on shave that determines how many times to attempt validation on a page before giving up (while sleeping for :step_time in between).

razor.shave(:validation_limit => 5, :step_time => 1) do
  # This will call the validations for each page 5 times, sleeping 1 second in between before raising an exception
end

validation_limit defaults to 10

Shave#validate

Shave#validate takes a block that validates a page’s shaving result hash arbitrarily.

razor = Razor.new(:blade => :firefox)
razor.goto "http://www.google.com"
razor.enter_text "//input[@name='q']", "Buy"
razor.submit

results = razor.shave do
  array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x|
    x.text
  end

  validate do |r|
    r[:suggestions].length == 8
  end
end

validate only runs if there is a next_page when using next_page.

Accessing the webdriver

You can access the webdriver straight from:

razor.webdriver

This webdriver follows the interface of watir

Accessing the selenium object

You can access the selenium object with:

razor.webdriver.driver

The selenium object follows the interface of selenium

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Why create this?

There are many great web scraping utilities available for Ruby (Scrubyt, ScrAPI to name two).

None of them use Firefox natively (Scrubyt is able to with JSSH, but JSSH is buggy).

Copyright © 2010 Mikkel Garcia. See LICENSE for details.

About

A simplistic web scraper built on watir-webdriver. It is used to power foreclosure listings

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages