Releases: adbar/trafilatura
Releases · adbar/trafilatura
v1.4.1
Extraction:
- extraction bugs fixed (#263, #266), more robust HTML doctype parsing
- XML output improvements by @knit-bee (#273, #274)
- adjust thresholds for link density in paragraphs
Metadata:
- improved title and sitename detection (#284)
- faster author, categories, domain name, and tags extraction
- fixes to author emoji regexes by @felipehertzer (#269)
Command-line interface:
- review argument consistency and add deprecation warnings (#261)
Setup:
- make download timeout configurable (#263)
- updated dependencies, use of faust-cchardet for Python 3.11
Full Changelog: v1.4.0...v1.4.1
trafilatura-1.4.0
Impact on extraction and output format:
- better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
- XML: preserve list type as attribute (#229)
- XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
- faster text cleaning and shorter code (#237 with @deedy5, #245)
- metadata: add language when detector is activated (#224)
- metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
- TXT: change markdown formatting of headers by @LaundroMat (#257)
Smaller changes in convenience functions:
- add function to clear caches (#219)
- CLI: change exit code if download fails (#223)
- settings: use "\n" for multiple user agents by @k-sareen (#241)
Updates:
- docs updated (and #244 by @dsgibbons)
- package dependencies updated
Full Changelog: v1.3.0...v1.4.0
trafilatura-1.3.0
- fast and robust
html2txt()
function added (#221) - more robust parsing (#228)
- fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
- extraction about 10-20% faster, slightly better recall
- partial fixes for memory leaks (#216)
- docs extended and updated (#217, #225)
- prepared deprecation of old
process_record()
function - more stable processing with updated dependencies
Full Changelog: v1.2.2...v1.3.0
trafilatura-1.2.2
- more efficient rules for extraction
- metadata: further attributes used (with @felipehertzer)
- better baseline extraction
- issues fixed: #202, #204, #205
- evaluation updated
Full Changelog: v1.2.1...v1.2.2
trafilatura-1.2.1
What's Changed
--precision
and--recall
arguments added to the CLI- better text cleaning: paywalls and comments
- improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
- further bugs fixed: #189, #192 (with @felipehertzer), #200
- efficiency: faster module loading and improved RAM footprint
Full Changelog: v1.2.0...v1.2.1
trafilatura-1.2.0
- efficiency: replaced module readability-lxml by trimmed fork
- bugs fixed: (#179, #180, #183, #184)
- improved baseline extraction
- cleaner metadata (with @felipehertzer)
Full Changelog: v1.1.0...v1.2.0
trafilatura-1.1.0
- encodings: better detection, output NFC-normalized Unicode
- maintenance and performance: more efficient code
- bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
- prepare compatibility with upcoming Python 3.11
- changed default settings
- extended documentation
Full Changelog: v1.0.0...v1.1.0
v1.0.0
- compress HTML backup files & seamlessly open .gz files
- support JSON web feeds
- graphical user interface integrated into main package
- faster downloads: reviewed backoff, compressed data
- optional modules: downloads with
pycurl
, language identification withpy3langid
- bugs fixed (#111, #125, #132, #136, #140)
- minor optimizations and fixes by @vbarbaresi in #124 & #130
- fixed array with single or multiples entries on json extractor by @felipehertzer in #143
- code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
- drop support for Python 3.5
Full Changelog: v0.9.3...v1.0.0
trafilatura-0.9.3
- better, faster encoding detection: replaced chardet with charset_normalizer
- faster execution: updated justext to 3.0
- better extraction of sub-elements in tables (#78, #90)
- more robust web feed parsing
- further defined precision- and recall-oriented settings
- license extraction in footers (#118)
Full Changelog: v0.9.2...v0.9.3
trafilatura-0.9.2
- first precision- and recall-oriented presets defined
- improvements in authorship extraction (thanks @felipehertzer)
- requesting TXT output with formatting now results in Markdown format
- bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
- setting for cookies in request headers (thanks @muellermartin)
- better date extraction thanks to htmldate update