Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #209

Open
frenzymadness opened this issue Aug 30, 2023 · 7 comments

Comments

@frenzymadness
Copy link

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

  • Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
  • It has a strong track record in terms of security – it's allowed-list-based.
  • It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see upstream issue.

nh3:

  • nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
  • Thanks to the Rust backend, nh3 is also significantly faster than bleach.
  • Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

@lopuhin
Copy link
Member

lopuhin commented Aug 30, 2023

Thanks again @frenzymadness , it looks as if the only usage of cleaner here is in relation with html_text library, so we could actually remove the usage of cleaner here and solve the issue in html_text in TeamHG-Memex/html-text#30

clean_node = cleaner.clean_html(node)
return html_text.etree_to_text(clean_node)

@westurner
Copy link

Could/should this be a separate component on PyPI that just wraps all of the methods with docs on the default?

@westurner
Copy link

Who builds LXML and nh3?

With e.g. cibuildwheel?

And are there gpg signatures of the package archive and its manifest of installable package files' checksums?

@frenzymadness
Copy link
Author

Could/should this be a separate component on PyPI that just wraps all of the methods with docs on the default?

The HTML clean functionality is now a separate project on PyPI: https://pypi.org/project/lxml-html-clean/ Next step is to make lxml itself use it.

Who builds LXML and nh3?

LXML uses cibuildwheel in its github workflows. nh3 also provides wheels but I'm not sure how they build them.

And are there gpg signatures of the package archive and its manifest of installable package files' checksums?

I don't know.

@westurner
Copy link

westurner commented Feb 27, 2024 via email

@frenzymadness
Copy link
Author

There's also https://github.com/mozilla/bleach

Yes, but it's deprecated and in maintenance-only mode (which might be enough).

@frenzymadness
Copy link
Author

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants