Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError with the initialization of an Item Field defined with None using ItemLoader #33

Open
Kiizuna067 opened this issue May 23, 2018 · 8 comments
Labels
enhancement New feature or request

Comments

@Kiizuna067
Copy link

Using Scrapy 1.5.0
I took a look at the FAQ section and nothing was relevant about it.
Same for issues with keyword KeyError on github, Reddit, or GoogleGroups.

As you can see below, it seems to me that here is an inconsistency when we load an Item or initialize it with a values as None or an empty string. First we add a value to our field (here title) through a ItemLoader. Then the loader creates an item with the load_item() method. Once it's done we can't access the field if the value was None or an empty string. The inconsistency is that the other method (initializing an Item directly with a field set to None or empty string) doesn't raise a KeyError (field is set).

The class TakeFirst don't return any value when they're set with None or an empty string. Which prevents the method load_item() in ItemLoader class to add an entry to the field.

Here is a minimal source code that represents the inconsistency.

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose

class MyItem(scrapy.Item):
    title = scrapy.Field()

class MyItemLoader(ItemLoader):
    default_output_processor = TakeFirst()
    title_in = MapCompose()

class MySpider(scrapy.Spider):
    name = "My"
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        titles = ['fake title', '', None]

        for title in titles:
            # First case: with ItemLoader
            loader = MyItemLoader(item=MyItem())
            loader.add_value('title', title)
            loaded_item = loader.load_item()

            # Second case: without ItemLoader
            item = MyItem(title=title)

            if title in ('', None):
                # inconsistency!
                assert not 'title' in loaded_item
                assert 'title' in item

We're using Python 3.5 for our project and found the following workaround to prevent this error.
We introduce a new class (DefaultAwareItem) which fulfills unset fields were default metadata has been set previously.

import scrapy

class DefaultAwareItem(scrapy.Item):
    """Item class aware of 'default' metadata of its fields.

    For instance to work, each field, which must have a default value, must
    have a new `default` parameter set in field constructor, e.g.::

        class MyItem(DefaultAwareItem):
            my_defaulted_field = scrapy.Field()
            # Identical to:
            #my_defaulted_field = scrapy.Field(default=None)
            my_other_defaulted_field = scrapy.Field(default='a value')

    """
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        for field_name, field_metadata in self.fields.items():
            self.setdefault(field_name, field_metadata.get('default'))

class MyItem(DefaultAwareItem):
    title = scrapy.Field()
    title_explicitely_set = scrapy.Field(default="empty title")
@Kiizuna067
Copy link
Author

@vincent-ferotin

@vincent-ferotin
Copy link

Other workarounds we found could be:

  • Remove test cases at TakeFirst.__call__() and ItemLoader.load_item(). This would be simpler, but surely breaks compatibility.
  • Introduce our own loader class: DefaultAwareItemLoader(ItemLoader) with its load_item() method calling its parent and then fulfilling undefined fields with values from 'default' metadata.

Also, please note that default behavior of Item class is not obvious for a newbie, where not explicitly set fields at initialization are absents from created instance, resulting in some KeyError. Although it is documented (in example of API), it's not so obvious, as one could hope that all "attributes" of its Item class exist by default. Hence the workaround purposed here by Nadir.

@wRAR
Copy link
Member

wRAR commented Jun 25, 2018

Hello @NadirRisc and @vincent-ferotin.

In my opinion it's OK for the code that uses an ItemLoader to have different behavior from the code that constructs an Item directly, as it's expected from item loaders to process input.

It's also fine to subclass item and item loader classes to get the behavior you need. Also, you can subclass an exporter and use export_empty_fields if you need the empty fields only in the export files and not in code. I think your DefaultAwareItem is an interesting class that can be useful in specific scenarios but it does more than what are you discussing here, as it seems you are only complaining here about setting empty values. If you want an Item that returns all its declared fields even when they are not set it should be enough to override several methods like __getitem__ to take self.fields into account.

Finally, I don't think it's OK to break the existing behavior by keeping empty values as the common pattern with item loaders assumes you don't know if the value is empty at the run time and want to ignore those (this is mostly about add_css and add_xpath). This is ensured not only in those two places you mention, but in some others too, including ItemLoader.add_value.

@OliverCrosby
Copy link

Newbie here,

I'm having the same problem and a hard time understanding the workaround.

The issue is that when the scraper encounters an empty field, it raises a keyerror and that item gets dropped. I want to be able to store the data, even if some of the fields are empty.

Would any of you be able to better explain and walk me through the workaround?

Here's an example of the error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/vagrant/usedcars/usedcars/pipelines.py", line 41, in process_item
    self.cur.execute("insert into listings(price, year, make, model, trim, body, kms, description, dealer, url) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(item['price'],item['year'],item['make'],item['model'],item['trim'],item['body'],item['kms'],item['description'],item['dealer'],item['url']))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 59, in __getitem__
    return self._values[key]
KeyError: 'trim'

Here is what I assume is the relevant source code:

class KijijiSpider(scrapy.Spider):
	name = 'carcrawler'
	start_urls = [
		'https://www.kijiji.ca/b-cars-trucks/ottawa/acura__audi__bmw__eagle__honda__hyundai__kia__lexus__scion__toyota__volvo-a3__a4__accent__accord__accord_crosstour__allroad__avalon__camry__civic__corolla__elantra__es__fit__forte__hs_250h__insight__ls__optima_hybrid__prius__prius_v__sonata__sonata_hybrid__soul__tsx__v70__x1__xb-used/c174l1700185a54a1000054a49?ad=offering',
	]
	def parse(self, response):
		for listing in response.css('div.title a::attr(href)').extract():
			yield scrapy.Request(response.urljoin(listing), callback=self.parse_listing)
	
	def parse_listing(self, response):
		l = ListingLoader(item=UsedcarsItem(), response=response)
		l.add_xpath('price', '//span[@itemprop="price"]/@content')
		l.add_xpath('year', '//dd[@itemprop="vehicleModelDate"]/text()')
		l.add_xpath('make', '//dd[@itemprop="brand"]')
		l.add_xpath('model', '//dd[@itemprop="model"]/text()')
		l.add_xpath('trim', '//dd[@itemprop="vehicleConfiguration"]')
		l.add_xpath('body', '//dd[@itemprop="bodyType"]')
		l.add_xpath('kms', '//dd[@itemprop="mileageFromOdometer"]')
		l.add_xpath('description', '//div[@itemprop="description"]')
		l.add_xpath('dealer', '//div[@id="R2SProfile"]/header/text()')
		l.add_value('url', response.url)
		# l.add_value('last_updated', 'today')
		return l.load_item()
class UsedcarsItem(scrapy.Item):
	price = scrapy.Field()
	year = scrapy.Field()
	make = scrapy.Field()
	model = scrapy.Field()
	trim = scrapy.Field(default=None)
	body = scrapy.Field()
	kms = scrapy.Field()
	description = scrapy.Field(
		input_processor=MapCompose(remove_tags, unicode.strip),
		output_processor=Join(),
	)
	dealer = scrapy.Field()
	url = scrapy.Field()
	# last_updated = scrapy.Field()

class ListingLoader(ItemLoader):
	default_output_processor = Join()
	default_item_class = UsedcarsItem

	kms_in = MapCompose(remove_tags, format_me)
	make_in = MapCompose(remove_tags)
	description_in = MapCompose(remove_tags)
	body_in = MapCompose(remove_tags)
	trim_in = MapCompose(remove_tags)

I also have a pipeline that loads everything into my DB.

@dashw00d
Copy link

I used Nadir's solution and it works perfect for my purposes. I tweaked it a bit to catch any that are not set and added a "Not set" default.

class DefaultAwareItem(Item):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Use python's built-in setdefault() function on all items
        for field_name, field_metadata in self.fields.items():
            if not field_metadata.get('default'):
                self.setdefault(field_name, 'No default set')
            else:
                self.setdefault(field_name, field_metadata.get('default'))

Thanks Nadir! :)

@pokedo0
Copy link

pokedo0 commented Jan 1, 2019

 def add_value(self, field_name, value, *processors, **kw):
        value = self.get_value(value, *processors, **kw)
        if value is None:
            return
        if not field_name:
            for k, v in six.iteritems(value):
                self._add_value(k, v)
        else:
            self._add_value(field_name, value)

what is the exact function about the code?

        if value is None:
            return

so, if the value of field is None, then item loader won't insert the key-value to the fields, and it will raise the KeyError if you want to get the key in the item.

@ejulio
Copy link
Collaborator

ejulio commented May 3, 2019

I think we can work something here.
Mostly because of the API.
When I call loader.load_item() I'd expect all the fields I defined to be there, even if I set them as None.

Maybe we can create some kind of setting in the ItemLoader or even a new ItemLoader that uses this behavior and the user can chose from them.

Also, I think this one is related scrapy/scrapy#2498

@tofarias
Copy link

Thanks @dashw00d and @NadirRisc, perfect solutions!!!!

@Gallaecio Gallaecio transferred this issue from scrapy/scrapy Oct 30, 2020
@Gallaecio Gallaecio added the enhancement New feature or request label Feb 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants