Razor¶ ↑

Razor is a simplistic web scraper built on watir-webdriver.

Razor avoids ‘magical’ ruby code. It aims to be a straight-forward tool that uses the web browser to access web pages and xpath (only xpath) to parse the DOM of a web page.

Usage¶ ↑

Example 1: Scraping Google for # of hits / suggestions¶ ↑

require 'razor'

razor = Razor.new(:blade => :firefox)
razor.goto "http://www.google.com"
razor.enter_text "//input[@name='q']", "Buy"
razor.submit

results = razor.shave do
  value(:stats, "//p[@id='resultStats']") do |x|
    x.text.scan(/of about (.+) for/)[0][0]
  end
  array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x|
    x.text
  end
end

results # =>
# {
#   :suggestions=> ["buy.com coupon", "buy.com promotion code",
#                   "overstock", "newegg", "deals2buy", "shopping",
#                   "buy cars", "buy costumes"],
#   :stats=>"1,740,000,000"
#}

Supported Browsers¶ ↑

Since Razor uses watir-webdriver, it can hypothetically support all the browsers Selenium does. :ie, :chrome, :firefox, even :remote

Razor is developed on Firefox because of Firebug. All testing is done in Firefox. All bugs will be addressed as they relate to Firefox. Any other support is welcome from our users.

Razor#shave¶ ↑

Shaving is as simple as possible, while maintaining power. Shaving returns a hash of name => block.call(element(s).at(xpath))

razor.shave do
  value name, xpath, block
end

value returns a single value (the first matching xpath), and processes it through the block, if a block is specified.

The block takes an Watir::Element (obtained from the xpath). See the watir documentation for more information watir

razor.shave do
  array name, xpath, block
end

Array is the same as value, except this returns an array of results.

next_page xpath¶ ↑

We provide a convenience method for next page elements. The next_page method will continue to click Next until it cannot find the element.

razor.shave do
  next_page "//a[text()='Next']"
end

Since all bad xpaths will immediately break from the next page loop, it can be hard to debug.

To debug your xpath, add:

razor.click "xpath"

Which will throw an error if the xpath cannot be found.

Next page will also call a block if one is specified. This block will be called right before clicking next_page.

next_page "//a[text()='Next']" do
  @@next_page_count += 1
end

with_results¶ ↑

If you have several hundred pages to traverse you may want to consume this information instead of accumulating it. In that situation you may look at with_results.

razor.shave(:flush_results_between_pages => true) do
  #...
  next_page "a[text()='Next']"

  with_results do |results|
    puts results.inspect
  end
end

Note: with_results is often useful with flush_results_between_pages. This flag will empty results between pages instead of accumulating them.

Closing webdriver¶ ↑

There is a finalizer that closes the browser window. However, Ruby finalizers aren’t great, and sometimes the window will need to be closed manually.

To do so:

razor.close

Waiting for page load¶ ↑

We poll each requested shaving every 0.5 seconds until we return a result, or 5 seconds elapse. You can override this behaviour if it does not suite the website you are polling.

To do so, add the following options to razor.shave

# Will poll 10 times a second for a total of 100 polls
razor.shave(:number_of_steps => 100, :step_time => 0.1) do
  ...
end

Page Validation¶ ↑

Razor is capable of validating the results of each page scraping. There is a :validation_limit option on shave that determines how many times to attempt validation on a page before giving up (while sleeping for :step_time in between).

razor.shave(:validation_limit => 5, :step_time => 1) do
  # This will call the validations for each page 5 times, sleeping 1 second in between before raising an exception
end

validation_limit defaults to 10

Shave#validate¶ ↑

Shave#validate takes a block that validates a page’s shaving result hash arbitrarily.

razor = Razor.new(:blade => :firefox)
razor.goto "http://www.google.com"
razor.enter_text "//input[@name='q']", "Buy"
razor.submit

results = razor.shave do
  array :suggestions, "//table[@id='brs']/tbody/tr/td" do |x|
    x.text
  end

  validate do |r|
    r[:suggestions].length == 8
  end
end

validate only runs if there is a next_page when using next_page.

Accessing the webdriver¶ ↑

You can access the webdriver straight from:

razor.webdriver

This webdriver follows the interface of watir

Accessing the selenium object¶ ↑

You can access the selenium object with:

razor.webdriver.driver

The selenium object follows the interface of selenium

Note on Patches/Pull Requests¶ ↑

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.

Why create this?¶ ↑

There are many great web scraping utilities available for Ruby (Scrubyt, ScrAPI to name two).

None of them use Firefox natively (Scrubyt is able to with JSSH, but JSSH is buggy).

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
examples		examples
lib		lib
test		test
.document		.document
.gitignore		.gitignore
LICENSE		LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
VERSION		VERSION
razor.gemspec		razor.gemspec

License

mikkel/razor

Folders and files

Latest commit

History

Repository files navigation

Razor¶ ↑

Usage¶ ↑

Example 1: Scraping Google for # of hits / suggestions¶ ↑

Supported Browsers¶ ↑

Razor#shave¶ ↑

next_page xpath¶ ↑

with_results¶ ↑

Closing webdriver¶ ↑

Waiting for page load¶ ↑

Page Validation¶ ↑

Shave#validate¶ ↑

Accessing the webdriver¶ ↑

Accessing the selenium object¶ ↑

Note on Patches/Pull Requests¶ ↑

Why create this?¶ ↑

Copyright¶ ↑

About

Resources

License

Stars

Watchers

Forks

Languages