Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around incorrect extraction of "reserved" HTML entities #76

Open
immerrr opened this issue Feb 20, 2017 · 1 comment
Open

Work around incorrect extraction of "reserved" HTML entities #76

immerrr opened this issue Feb 20, 2017 · 1 comment

Comments

@immerrr
Copy link
Contributor

immerrr commented Feb 20, 2017

The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.

A quick example:

In [13]: etree.fromstring ('<p>&#133;</p>').text
Out[13]: u'\x85'

whereas modern browsers usually show it as an ellipsis :

In [5]: u'\u2026'
Out[5]: '…'
@redapple
Copy link
Contributor

Thanks for reporting @immerrr !
It does not look straightforward to fix though.
html5lib does the replacement clearly,
while with libxml2 HTMLParser it seems this case is not handled.

Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty.
Sample code:

>>> import string
>>> 
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>> 
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>> 
>>> def charref_replace(s):
...     out = u''
...     for c in s:
...         if c in table:
...             out += table[c]
...         else:
...             out += c
...     return out
... 
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
...     def data(self, data):
...         return super(ReservedReplacementTarget, self).data(charref_replace(data))
... 
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, &#133; world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants