Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Testing your spiders

Ed Finkler edited this page Sep 2, 2013 · 2 revisions

Automated testing

Automated tests run existing spiders against HTML documents to ensure that recipes are extracted correctly.

Running automated tests

To run the test suite, do the following

cd scrapy_proj/tests
nosetests

You should get output like the following:

/Users/coj/Dropbox/Sites/openrecipes/scrapy_proj/openrecipes/pipelines.py:7: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
from scrapy.conf import settings
...........................................................................................
----------------------------------------------------------------------
Ran 91 tests in 10.213s

OK

Adding New Tests

The testing scripts will automatically run tests against HTML files placed in directories named for the corresponding spider. For example, if you have a spider class file at scrapy_proj/openrecipes/spiders/foobar_spider.py, the testing scripts will look for a directory scrapy_proj/tests/html_data/foobar. If a directory is found, any files with a .html extension in that directory will be run through the spider's parse_item method. The results are tested against assertions in the do_test_scraped_item() method defined in scraper_tests.py.

Note: assertions in do_test_scrapted_item() will be used in every spider's output. We haven't yet defined a method for creating spider-specific tests. Input is welcomed.

You can grab html data to test against for a given spider by using the openrecipes/scrapy_proj/grab_html.py utility script. You run it like so:

python grab_html.py foobar http://www.foobar.com/2013/04/10-minute-thai-shrimp-cucumber-avocado-salad-recipe/

The script will download the HTML document and write it to scrapy_proj/tests/html_data/foobar/item_<document_title>.html The next time you run the automated tests, this file will be used to test the foobar_spider.parse_item() method.

We recommend creating at least 3 HTML files for each spider.

Interactive testing

You can use the scrapy shell and Python's reloading capabilities to quickly test your spiders. This example will use elanaspantry.com.

To test a spider:

  1. cd into scrapy_proj.
  2. Open the scrapy shell with scrapy shell.
  3. Fetch a recipe with fetch('http://www.elanaspantry.com/ratio-rally-quick-breads/').
  4. Import the spider with from openrecipes.spiders import elanaspantry_spider.
  5. Test your spider with elanaspantry_spider.ElanaspantryMixin().parse_item(response).

This should return something like this:

[{'datePublished': u'April 4, 2011',
'description': [u'This gluten free muffin recipe is made with almond flour and is part of the quick bread ratio rally and my attempt to make a basic template for a muffin recipe.'],
'image': [u'http://www.elanaspantry.com/blog/wp-content/uploads/2011/04/gluten-free-almond-flour-quick-bread-muffins-ratio-rally-recipe.jpg'],
'ingredients': [u'4 ounces blanched almond flour (about 1 cup)',
                 u'4 ounces eggs (about 2 large eggs)',
                 u'1 ounce agave nectar or honey (around 1 tablespoon)',
                 u'\xbc teaspoon baking soda',
                 u'\xbd teaspoon apple cider vinegar'],
'name': [u'Almond Flour Muffins'],
'recipeYield': u'Makes 4 muffins',
'source': 'elanaspantry',
'url': 'http://www.elanaspantry.com/ratio-rally-quick-breads/'}]'

After making changes to your spider, you'll need to:

  1. Reload the spider with reload(elanaspantry_spider).
  2. And test it with elanaspantry_spider.ElanaspantryMixin().parse_item(response).