Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser stops parsing upon encountering <style> tag #118350

Open
savchenko opened this issue Apr 27, 2024 · 3 comments
Open

HTMLParser stops parsing upon encountering <style> tag #118350

savchenko opened this issue Apr 27, 2024 · 3 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@savchenko
Copy link

savchenko commented Apr 27, 2024

Bug report

Bug description:

An example where parsing stops after the <style color="red">:

from html.parser import HTMLParser
from io import StringIO

class HTML2text(HTMLParser):
    def __init__(self):
        super().__init__()
        self.data = StringIO()
    def handle_data(self, html):
        self.data.write(html)
    def get_data(self):
        return self.data.getvalue().strip()

html_test = '''
<!DOCTYPE html>
<head><title>Glued</title></head><body><some><style color="red">title</bar>
<h1>Spacious             </h1><a href="https://heading.net">heading.net</a>
<span>not<a href="https://www.arpa.home">my.home.arpa</a><p>        URL</p>
</body></html>
'''

parser = HTML2text()
parser.feed(html_test)
print(parser.get_data())

Changing a single character in the word "style" restores the normal functionality.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

@savchenko savchenko added the type-bug An unexpected behavior, bug, or error label Apr 27, 2024
@JelleZijlstra
Copy link
Member

Isn't this because you didn't close your <style> tag? If I remember correctly style tags go on until </style> is seen regardless of any other tag-like text within the tag, because they may contain text in other languages.

@savchenko
Copy link
Author

@JelleZijlstra , indeed! Closing <style> allows the snippet to be parsed. However, isn't it inconsistent with the the behaviour observed when parsing other tags?

For example, this broken HTML is parsed correctly:

<head><title>Rebelious<h1>Heading<a href="https://example.net">example.net
<span>not<a href="https://www.arpa.home">arpa.home<p>Paragraph<h2>and more

@vadmium
Copy link
Member

vadmium commented Apr 28, 2024

The difference is that <style> is a “raw text element”. It cannot contain HTML markup. Most other tags can be followed by other markup. See https://html.spec.whatwg.org/multipage/syntax.html#raw-text-elements

However I believe your HTML with <title>Rebelious<h1> . . . does trigger a bug. The <title> element is supposed to be an “escapable raw text element”, so <h1> should be counted as part of the raw title text. However it gets parsed as a tag:

Encountered a start tag: title
Encountered some data : Rebelious
Encountered a start tag: h1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants