Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pages with Emoji fail to render on macOS (due to lxml bug) #3686

Open
LudovicRousseau opened this issue May 8, 2023 · 4 comments
Open

Pages with Emoji fail to render on macOS (due to lxml bug) #3686

LudovicRousseau opened this issue May 8, 2023 · 4 comments

Comments

@LudovicRousseau
Copy link

Environment

Python Version:
Python 3.11.3 (main, Apr 7 2023, 19:29:16) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Installed from Homebrew

Nikola Version:
Nikola 8.2.4

Operating System:
macOS Monterey 12.6.5

Description:

If I use a Unicode character like a smiley or 😺 in a .rst page then the generated html is bogus.
Source unicode.rst page:

.. title: Unicode
.. slug: unicode
.. date: 2023-05-08 18:48:37 UTC+02:00
.. tags: 
.. category: 
.. link: 
.. description: 
.. type: text

😺

The generated html page contains:

[...]
</header><div class="e-content entry-content" itemprop="articleBody text">
    <p>h   t   m   l   &gt;   </p>
    </div>
[...]

And in the browser I see: "h t m l > " for the content of the post.

I have no problem with another Unicode character like an accented letter like "è".

Debian is OK

I then tried the same manipulation on a Debian GNU/Linux version 12 (the next Debian stable) and I have no problem.
On Debian I use:

  • Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
  • Nikola 8.2.4

In both cases I use a venv.

Debug

I tried to debug but I am new to Nikola.
I tried nikola rst2html.

On macOS I get:

$ nikola rst2html posts/2023/05/unicode.rst 

<!DOCTYPE html>
<html><body><p>!   D   O   C   T   Y   P   E       h   t   m   l   &gt;   
   </p></body></html>

Here again the result is correct if run on Debian.

Maybe the problem is in a dependency used by Nikola.

@Kwpolska
Copy link
Member

Kwpolska commented May 8, 2023

This is most likely a bug with lxml, please report it to the lxml project.

@LudovicRousseau
Copy link
Author

If I use the program ./nikola/bin/rst2html.py (on macOS) to convert my .rst post I have no problem.
I get:

[...]
<div class="document">


<!-- title: Unicode -->
<!-- slug: unicode -->
<!-- date: 2023-05-08 18:48:37 UTC+02:00 -->
<!-- tags: -->
<!-- category: -->
<!-- link: -->
<!-- description: -->
<!-- type: text -->
<p>😺</p>
</div>
</body>
</html>

I have no idea how lxml is used by Nikola.
Can you provide a sample code using lxml that should fail so I can report the issue to lxml?

@Kwpolska
Copy link
Member

Kwpolska commented May 9, 2023

Sure, here’s some sample code:

import lxml.html
html = """<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""

parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(html, parser)
data = lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True, doctype='<!DOCTYPE html>')
print(data)

Can you reproduce the issue using this code on macOS? For reference, I get the following output on Windows and Linux:

b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8"></head>\n<body>\n<h1>Hello, world!</h1>\n<div>\n<p>\xf0\x9f\x98\xba</p>\n</div>\n</body>\n</html>\n'

@LudovicRousseau
Copy link
Author

Bingo!
On macOS I get:

>>> print(data)
b'<!DOCTYPE html>\n<html><body><p>!   D   O   C   T   Y   P   E       h   t   m   l   &gt;   \n   </p></body></html>\n'

I reported the lxml issue at https://bugs.launchpad.net/lxml/+bug/2019038
Thanks

@Kwpolska Kwpolska changed the title Some Unicode characters (like 😺) breaks the HTML page generation Pages with Emoji fail to render on macOS May 9, 2023
@Kwpolska Kwpolska changed the title Pages with Emoji fail to render on macOS Pages with Emoji fail to render on macOS (due to lxml bug) May 9, 2023
@Kwpolska Kwpolska pinned this issue May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants