Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectors return corrupted, recursive DOM on some sites #184

Open
shervinmathieu opened this issue Feb 7, 2020 · 3 comments
Open

Selectors return corrupted, recursive DOM on some sites #184

shervinmathieu opened this issue Feb 7, 2020 · 3 comments

Comments

@shervinmathieu
Copy link

Description

On a specific site, scrapy selectors (css and xpath) corrupt the DOM recursively and return an incorrect amount of items as a result. I've encountered this issue while parsing base-search.net search results but this bug might occur on other sites as well.

Steps to Reproduce

Example for base-search.net

  1. Begin parsing a base.net search result page, ex: scrapy shell "https://www.base-search.net/Search/Results?lookfor=graph+visualisation"
  2. Notice the amount of div items with the class ".record-panel" present: response.css(".record-panel"), output should be 10 items
  3. Now select an item inside this div, for example response.css(".link-gruen"), output should also be only 10 items
  4. Now attempt to chain these two selectors: response.css(".record-panel").css(".link-gruen"), output now returns 55(!) items, when it has been determined there are only 10 .link-gruen items in the DOM
  5. Notice that response.css(".record-panel .record-panel") returns a non-zero amount of items, however on the original DOM no item with such a class exists
  6. Attempt to chain selectors on this non-existent element, and notice the amount of .link-gruen items returned increases recursively: response.css(".record-panel").css(".record-panel").css(".link-gruen") returns 220 items, response.css(".record-panel").css(".record-panel").css(".record-panel").css(".link-gruen") returns 715 items

Expected behavior:
Only ten items should be returned in this example.

Actual behavior:
Each selector has a DOM that contains their own .record-panel, but also all following .record-panel divs, nested recursively. Chaining selectors on this corrupted DOM corrupts it even further, increasing the amount of items returned infinitely.

Reproduces how often: Always

Versions

Scrapy : 1.8.0
lxml : 4.5.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.7.5 (default, Nov 7 2019, 10:50:52) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Linux-4.15.0-76-generic-x86_64-with-Ubuntu-18.04-bionic

Additional context

Issue happens on both css and xpath selectors. Using equivalent xpath selectors lead to the same result.
Notice by opening view(response) that the DOM scrapy receives for parsing does not contain any recursive items, for example selecting .record-panel .record-panel yields no results on the browser selector (local file, not the internet result). However, on scrapy selecting response.css(".record-panel .record-panel") returns 9 items, response.css(".record-panel .record-panel .record-panel") returns 8 items, and so on.

@elacuesta elacuesta transferred this issue from scrapy/scrapy Feb 7, 2020
@elacuesta
Copy link
Member

Transferred the issue here since it doesn't seem to be a problem with Scrapy specifically but rather with Parsel, the underlying selector library:

In [1]: from parsel import Selector, __version__                                                                                                                                                                                              

In [2]: __version__                                                                                                                                                                                                                           
Out[2]: '1.5.2'

In [3]: import requests                                                                                                                                                                                                                       

In [4]: sel = Selector(text=requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text)                                                                                                                      

In [5]: len(sel.css(".record-panel"))                                                                                                                                                                                                         
Out[5]: 10

In [6]: len(sel.css(".link-gruen"))                                                                                                                                                                                                           
Out[6]: 10

In [7]: len(sel.css(".record-panel").css(".link-gruen"))                                                                                                                                                                                      
Out[7]: 55

@Gallaecio Gallaecio added the bug label Feb 11, 2020
@Gallaecio
Copy link
Member

Gallaecio commented May 22, 2020

👀

>>> for panel in sel.css('.record-panel'):
...     print(len(panel.css('.link-gruen')))
... 
10
9
8
7
6
5
4
3
2
1

@Gallaecio
Copy link
Member

Gallaecio commented May 22, 2020

There is a bug in the source HTML which browsers manage to fix but lxml does not: they use --!> to close HTML comments, instead of -->.

Workaround: .replace('--!>', '-->')

>>> text = requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text
>>> text = text.replace('--!>', '-->')
>>> sel = Selector(text=text)
>>> len(sel.css(".record-panel").css(".link-gruen"))
10
>>> for panel in sel.css('.record-panel'):
...     print(len(panel.css('.link-gruen')))
... 
1
1
1
1
1
1
1
1
1
1

I suggest we leave this open as a feature request.

Hopefully #83 will allow fixing this, but this issue should remain open: if a new parser introduced as part of #83 does not fix this issue, we should look for alternative parsers that do support this issue, or get support for this upstream on one of the supported parsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants