Support "html5" type to use html5lib parser #83

redapple · 2017-05-09T10:19:48Z

Every now and then we get a bug report about some HTML source not being parsed as a browser would.

There was the idea in Scrapy of adding an "html5" type to switch to an HTML5 compliant parser.
One of these is html5lib that can be used with lxml.

joaquingx · 2019-01-11T20:25:52Z

I'll work on this 💪

grahamanderson · 2019-05-07T18:26:50Z

Is there any update on this?
In my project, I can scrape items with beautifulsoup that fail with scrapy.
In my case, it happens in 1/20 pages in my scrapy project.
I hate to waste page data--if I don't have to :)
Or, is there an elegant workaround?

Example Code

import requests
from lxml import etree, html
from bs4 import BeautifulSoup
url = 'https://www.homeadvisor.com/rated.CoventryAdditions.60530954.html'
r = requests.get(url)

# Author Fails
from parsel import Selector
sel = parsel.Selector(text = r.text)
print('Page Title is {}'.format(sel.xpath("//title//text()").get())) # Success
print(sel.xpath('//span[contains(@itemprop,"author")]//text()').get()) # []
print(sel.css('span[itemprop="author"]').get()) []

# Author Works
soup = BeautifulSoup(r.content, 'lxml') # also works with html5lib
print('title is: {}'.format(soup.title.text)) #Success
for author in soup.findAll("span", {"itemprop":"author"}):
    print(author.text) # Success

Gallaecio · 2019-05-08T07:44:43Z

@grahamanderson You can try and review the pull request at #133

Alternatively, you can use the following workaround in a downloader middleware or in the callbacks of your spider:

from bs4 import BeautifulSoup

# …

response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))

grahamanderson · 2019-05-08T17:22:37Z

Thank you @Gallaecio !
I used the scrapy-beautifulsoup code...as middleware
Strangely, I did not have to resort to using html5lib. BeautifulSoup's lxml parser seems a bit more robust than Scrapy/Parsel's LXML parser.

class BeautifulSoupMiddleware(object):
    def __init__(self, crawler):
        super(BeautifulSoupMiddleware, self).__init__()

        self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_response(self, request, response, spider):
        """Overridden process_response would "pipe" response.body through BeautifulSoup."""
        return response.replace(body=str(BeautifulSoup(response.body, self.parser)))

Gallaecio · 2020-02-04T14:01:58Z

From @whalebot-helmsman:

There is html 5 parser implementation for lxml (https://lxml.de/api/lxml.html.html5parser-pysrc.html )

aryamanpuri · 2020-03-16T13:28:29Z

Hi all,
I am too late to start for GSoC 2020, found this issue interesting and having good knowledge in web-dev with python and javascript.
Can someone help me with how to get started?

Gallaecio · 2020-03-16T19:22:03Z

Start by reading up http://gsoc2020.scrapinghub.com/participate and the links at the top to Python and Google resources. Mind that student applications have just started and will close in a couple of weeks.

aryamanpuri · 2020-03-17T06:43:30Z

So should I start contributing to the project or start making a good proposal ?

Gallaecio · 2020-03-17T09:35:36Z

You can start with whichever you prefer, but you need to do both before the deadline, proposals from student that have not submitted any patch will not be considered.

If you start with your proposal, and you can manage to isolate a small part of the proposal that you can implement in a week or less, you could implement that as your contribution, and that would speak high of your ability to complete the rest of the project.

aryamanpuri · 2020-03-17T09:55:40Z

Parsel can extract data from the Html and XML but due to some exceptions in the Html like the use of # in the attributes of tag and having the different technique to visualize the tags in HTML, there is need of html5lib parser.
Do I get it right?
Anything more that can help me?

Gallaecio · 2020-03-17T10:17:16Z

Make sure you have a look at the issues linked from this thread.

Another benefit of supporting a parser like html5lib, for example, is that the HTML tree that it builds in memory is closer to what you see in a web browser when you use the Inspect feature.

lopuhin · 2020-09-22T13:00:46Z

There is html 5 parser implementation for lxml (https://lxml.de/api/lxml.html.html5parser-pysrc.html)

In my tests it looked quite slow (e.g. 130 ms to parse an html which took lxml.html only 9 ms), while html5-parser looks fast (only 7 ms for the same html) and returns lxml tree as well: https://html5-parser.readthedocs.io/en/latest/

EDIT: although there is a problem that html5-parser returns lxml.etree._Element while lxml.html returns lxml.html.HtmlElement which have slightly different API.

99bcsagar · 2021-03-10T17:28:24Z

can I work on it for GSoC

Gallaecio · 2021-03-10T20:51:27Z

can I work on it for GSoC

That would be great. Please have a look at https://gsoc2021.zyte.com/participate for details.

99bcsagar · 2021-03-12T11:20:34Z

sir,is it a continuation of previous contributions or should i do it completely new.

Gallaecio · 2021-03-12T11:24:52Z

There has been a previous attempt with feedback, #133, which could serve as a starting point or inform an alternative approach. Other than that, this would need to be done from scratch, yes.

ashishsun · 2021-03-17T03:27:00Z

Hello I am mew here should I work on this project? There are not many new issues listed here.

Gallaecio · 2021-03-17T09:49:31Z

Hello I am mew here should I work on this project? There are not many new issues listed here.

Do you mean as a Google Summer of Code student candidate?

garput2 · 2021-04-11T10:03:54Z

Hello, my name is Garry Putranto Arimurti, a GSoC candidate. I am interested in contributing to this project and I would like to learn more about the issue so I can work on it. Is there any specific issue I can work on and improve here? Thanks!

Gallaecio · 2021-04-13T08:34:16Z

@garput2 It’s hard to provide feedback without specific questions, but I guess #153 is a somehow related pull request that gives a view of what would probably be a good first step towards supporting an HTML5 parser.

On the other hand, to participate in GSoC with us you need a pre-application pull request, in addition to presenting a proposal. Since today is the last day to present a proposal, your timing is a little tight.

tonal · 2021-08-11T04:06:44Z

Create Selector for html5:

from lxml.html.html5parser import document_fromstring

def selector_from_html5(response):
  root = document_fromstring(response.text)
  selector = Selector(response, type='html', root=root)
  return selector

lopuhin · 2021-08-11T10:27:49Z

I think recent work done by @whalebot-helmsman on https://github.com/kovidgoyal/html5-parser/ is relevant here - now it's possible to use a fast and compliant html5 parser (using a variant of the gumbo parser) and get an lxml.html tree as a result with treebuilder='lxml_html'

whalebot-helmsman · 2021-08-12T12:37:01Z

Yes, it is possible. There is one thing which makes widespread adoption of html5-parser. You need to install lxml from sources.

vladiscripts · 2021-10-07T14:41:37Z

  response.replace(body=str(BeautifulSoup(response.body, self.parser)))

You can get a charset error using this, if the original page was not utf-8 encoded, because the response has set to other encoding.
So, you must first change the encoding.

In addition, there may be a problem of character escaping.
For example, if the character < is encountered in the text of html, then it must be escaped as <. Otherwise, "lxml" will delete it and the text near it, considering it an erroneous html tag.
"html5lib" escapes characters, but is slow.

r = response.replace(encoding='utf-8', body=str(BeautifulSoup(response.body, 'html5lib')))

"html.parser" is faster, but from_encoding must also be specified (to example 'cp1251').

r = response.replace(encoding='utf-8', body=str(BeautifulSoup (response.body, 'html.parser', from_encoding='cp1251')))

averms · 2021-10-09T00:34:33Z

Yes, it is possible. There is one thing which makes widespread adoption of html5-parser. You need to install lxml from sources.

Another option is selectolax. The only issue would be a possible (idk if this is an actual issue) legal problem: rushter/selectolax#18.

Gallaecio · 2021-10-09T06:06:03Z

I believe there is no legal issue.

That said, Parsel heavily relies on lxml, whereas https://github.com/rushter/selectolax seems to go a different route, offering much better performance according to them. So I think integrating selectolax into Parsel while keeping the Parsel API and behavior would be rather hard, compared to something like #83 (comment).

On the other hand, if the upstream benchmark results are to be trusted (~7 times faster than lxml), in the long term it may be worth looking into replacing, or at least allowing to replace, the Parsel lxml backend with one based on selectolax. But that should probably be logged as a different issue. Maybe a good idea for a Google Summer of Code project.

deepakdinesh1123 · 2022-02-23T10:12:11Z

Seems like selectolax does not offer support for XPath selectors and supports only CSS selectors, if lxml backend were to be replaced with selectolax should XPath selectors be supported by converting XPath to CSS? This can be done by adding support for conversion int cssselect, I found a quick workaround by using this library cssify.

Gallaecio · 2022-02-23T18:55:41Z

should XPath selectors be supported by converting XPath to CSS?

I would not go that route because while all CSS Selectors expressions can be expressed as XPath 1.0, it does not work the other way around. I think supporting CSS Selectors expressions only would be OK in this case.

deepakdinesh1123 · 2022-02-27T13:13:13Z

I think supporting CSS Selectors expressions only would be OK in this case.

So, should the existing backend be preserved for supporting xpath along with new parser for css ? or should another parser which supports xpath be added?

Gallaecio · 2022-02-28T15:50:54Z

I am just thinking out loud here, I have no strong opinions, but my guess is that, from the user perspective, you would choose a parser (or pass an instance of it) when creating a Selector, and for this alternative parser calls to xpath and related methods would raise NotImplementedError.

aadityasinha-dotcom · 2022-03-18T05:22:34Z

Hey!! Is this issue available?
Can anyone discuss it with me?

lopuhin · 2022-03-18T07:14:59Z

hi @aadityasinha-dotcom yes the issue is open and available, please continue discussion here.

aadityasinha-dotcom · 2022-03-18T07:28:13Z

So, I want to work on this project for GSOC'22.
Here's what I found:-

Adding support to use html5lib instead of default HTML parser
Adding HTML5Parser option (Create Selector for html5 as mentioned above)

aadityasinha-dotcom · 2022-03-18T07:29:26Z

Also, I want to know how a pre-application pull request looks like?

Gallaecio · 2022-03-18T09:49:54Z

Also, I want to know how a pre-application pull request looks like?

It can be anything, really. Please, check out https://gsoc2022.zyte.com/participate#pre-application-pull-request and let us know if you have any question beyond what is covered there.

aadityasinha-dotcom · 2022-03-18T10:19:14Z

@Gallaecio Can I make a PR with some description and ideas/tasks regarding this?

Gallaecio · 2022-03-18T10:20:22Z

Sure, go ahead.

deepakdinesh1123 · 2022-03-19T09:19:36Z

and for this alternative parser calls to xpath and related methods would raise NotImplementedError.

Would it be better to use selectolax as the parser for css and if xpath method is called on the object, parse it through html5lib or html5-parser or lxml? This way it would be easy to use both css and xpath selectors.

Gallaecio · 2022-03-21T12:15:08Z

I don’t think it is a good idea to have a Selector class that parses the input data twice with 2 different parsers and returns data from one or another depending on the method used to extract data.

If a user really wants that, I think it is OK to ask them to instance 2 different Selector objects.

deepakdinesh1123 · 2022-03-21T17:02:58Z

So the basic idea is to add support for

selectolax - if xpath calls are made to this object a NotImplementedError is to raised, users can create a different Selector
object with other if required.
html5lib - I am having a little trouble with this, I tried to create a parser with etree and lxml but both returned None, I am
looking into it.
html5-parser - installing html5-parser on windows is a tedious process as it relies on libxml2 and zlib and libxslt and the
docs mentioned to use visual studio 2015 to install html5-parser and its dependencies. It's not so tedious in unix though,
should adding support for it still be considered?

right? If there are no other changes I'll upload a draft proposal tomorrow.

Gallaecio · 2022-03-21T18:34:01Z

I believe the idea is to implement 1 of those 3 solutions (and it is open to alternative solutions as well), not all 3.

Ideally we should compare different aspects of proposed solutions, and choose one. I think performance may be the main determining factor, although query language support (e.g. XPath 1.0, XPath 2, CSS Selector) and behavior details (i.e. can it be an in-place replacement for lxml, or would some outputs differ?) may play a role if 2 or more solutions offer similar performance.

lopuhin · 2022-03-22T07:11:26Z

installing html5-parser is not so tedious in unix though

It still requires compilation, so we'd still have to provide wheels to be able to install scrapy without a compiler, like it's possible today.

deepakdinesh1123 · 2022-03-22T15:29:43Z

I have uploaded a draft proposal it still needs a lot of work, the timeline on the website doesn't specify separate timelines for 175 and 350hr project so I still have to change that part a bit. Please suggest any changes that I need to make in my proposal

GSOC Proposal

joaquingx mentioned this issue Jan 11, 2019

Add HTML5Parser option #133

Closed

Gallaecio added the enhancement label Aug 22, 2019

Gallaecio added the discuss label Sep 24, 2019

Gallaecio mentioned this issue May 22, 2020

Selectors return corrupted, recursive DOM on some sites #184

Open

aadityasinha-dotcom mentioned this issue Mar 18, 2022

Added "html5" support to use html5lib parser #238

Closed

Support "html5" type to use html5lib parser #83

Support "html5" type to use html5lib parser #83

Comments

redapple commented May 9, 2017 • edited

joaquingx commented Jan 11, 2019

grahamanderson commented May 7, 2019 • edited

Gallaecio commented May 8, 2019

grahamanderson commented May 8, 2019

Gallaecio commented Feb 4, 2020

aryamanpuri commented Mar 16, 2020

Gallaecio commented Mar 16, 2020

aryamanpuri commented Mar 17, 2020

Gallaecio commented Mar 17, 2020

aryamanpuri commented Mar 17, 2020

Gallaecio commented Mar 17, 2020

lopuhin commented Sep 22, 2020 • edited

99bcsagar commented Mar 10, 2021 • edited

Gallaecio commented Mar 10, 2021

99bcsagar commented Mar 12, 2021

Gallaecio commented Mar 12, 2021

ashishsun commented Mar 17, 2021

Gallaecio commented Mar 17, 2021

garput2 commented Apr 11, 2021

Gallaecio commented Apr 13, 2021

tonal commented Aug 11, 2021

lopuhin commented Aug 11, 2021

whalebot-helmsman commented Aug 12, 2021

vladiscripts commented Oct 7, 2021

averms commented Oct 9, 2021

Gallaecio commented Oct 9, 2021 • edited

deepakdinesh1123 commented Feb 23, 2022

Gallaecio commented Feb 23, 2022

deepakdinesh1123 commented Feb 27, 2022

Gallaecio commented Feb 28, 2022

aadityasinha-dotcom commented Mar 18, 2022 • edited

lopuhin commented Mar 18, 2022

aadityasinha-dotcom commented Mar 18, 2022

aadityasinha-dotcom commented Mar 18, 2022

Gallaecio commented Mar 18, 2022

aadityasinha-dotcom commented Mar 18, 2022

Gallaecio commented Mar 18, 2022

deepakdinesh1123 commented Mar 19, 2022 • edited

Gallaecio commented Mar 21, 2022

deepakdinesh1123 commented Mar 21, 2022

Gallaecio commented Mar 21, 2022

lopuhin commented Mar 22, 2022

deepakdinesh1123 commented Mar 22, 2022

redapple commented May 9, 2017 •

edited

grahamanderson commented May 7, 2019 •

edited

lopuhin commented Sep 22, 2020 •

edited

99bcsagar commented Mar 10, 2021 •

edited

Gallaecio commented Oct 9, 2021 •

edited

aadityasinha-dotcom commented Mar 18, 2022 •

edited

deepakdinesh1123 commented Mar 19, 2022 •

edited