ArticleParse

Library that strips boilerplate HTML from news articles and performs heuristic analysis to determine the body of the article. Ranks text sections of the website by probability of being news content.

Currently uses for analysis:

Section Length
Section Position
Number of Anchors in a Section
Anchor Density in a Section
Word Count
Uppercase Word Count
Average Word Length
Average Sentence Length
Number of Sentences

This is a work in progress. I have manually tested it on several news websites, but extensive testing still needs to be performed.

Supports Python3

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
articleparse		articleparse
LICENSE		LICENSE
README.md		README.md
example.py		example.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

articleparse

articleparse

LICENSE

LICENSE

README.md

README.md

example.py

example.py

setup.py

setup.py

Repository files navigation

ArticleParse

About

Releases

Packages

Languages

License

bmoscon/ArticleParse

Folders and files

Latest commit

History

Repository files navigation

ArticleParse

About

Topics

Resources

License

Stars

Watchers

Forks

Languages