Work around incorrect extraction of "reserved" HTML entities #76

immerrr · 2017-02-20T16:32:37Z

The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.

A quick example:

In [13]: etree.fromstring ('<p>&#133;</p>').text
Out[13]: u'\x85'

whereas modern browsers usually show it as an ellipsis …:

In [5]: u'\u2026'
Out[5]: '…'

The text was updated successfully, but these errors were encountered:

redapple · 2017-02-23T17:17:09Z

Thanks for reporting @immerrr !
It does not look straightforward to fix though.
html5lib does the replacement clearly,
while with libxml2 HTMLParser it seems this case is not handled.

Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty.
Sample code:

>>> import string
>>> 
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>> 
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>> 
>>> def charref_replace(s):
...     out = u''
...     for c in s:
...         if c in table:
...             out += table[c]
...         else:
...             out += c
...     return out
... 
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
...     def data(self, data):
...         return super(ReservedReplacementTarget, self).data(charref_replace(data))
... 
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, &#133; world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!

Gallaecio added the enhancement label Sep 17, 2019

Gallaecio added the discuss label Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around incorrect extraction of "reserved" HTML entities #76

Work around incorrect extraction of "reserved" HTML entities #76

immerrr commented Feb 20, 2017

redapple commented Feb 23, 2017

Work around incorrect extraction of "reserved" HTML entities #76

Work around incorrect extraction of "reserved" HTML entities #76

Comments

immerrr commented Feb 20, 2017

redapple commented Feb 23, 2017