Skip to content

SueChaplain/html2text2

 
 

Repository files navigation

Build Status Coverage Status Downloads Version Egg? Wheel? Format License

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Usage: html2text.py [(filename|url) [encoding]]

Option Description
--version Show program's version number and exit
-h, --help Show this help message and exit
--ignore-links Don't include any formatting for links
--ignore-images Don't include any formatting for images
-g, --google-doc Convert an html-exported Google Document
-d, --dash-unordered-list Use a dash rather than a star for unordered list items
-b BODY_WIDTH, --body-width=BODY_WIDTH Number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT Number of pixels Google indents nested lists
-s, --hide-strikethrough Hide strike-through text. only relevent when -g is specified as well
--escape-all Escape all special characters. Output is less readable, but avoids corner case formatting issues.

Or you can use it from within Python:

import html2text
print html2text.html2text("<p>Hello, world.</p>")

Or with some configuration options:

import html2text
h = html2text.HTML2Text()
h.ignore_links = True
print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")

Originally written by Aaron Swartz. This code is distributed under the GPLv3.

How to install

html2text is available on pypi https://pypi.python.org/pypi/html2text

$ pip install html2text

How to run unit tests

PYTHONPATH=$PYTHONPATH:. coverage run --source=html2text setup.py test -v

About

Convert HTML to Markdown-formatted text.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 67.6%
  • HTML 32.4%