Module for automatic summarization of text documents and HTML pages.
-
Updated
May 6, 2024 - Python
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Automatically extract the main text content (and more) from an HTML document
从html中提取正文,用于新闻类网页
PHP library which determines which css is used from html snippets.
Go package that cleans a HTML page for better readability.
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.
Add a description, image, and links to the html-extractor topic page so that developers can more easily learn about it.
To associate your repository with the html-extractor topic, visit your repo's landing page and select "manage topics."