Releases: adbar/trafilatura
Releases · adbar/trafilatura
trafilatura-1.9.0
Extraction:
- add markdown as explicit output (#550)
- improve recall preset (#571)
- speedup for readability-lxml (#547)
- add global options object for extraction and use it in CLI (#552)
- fix: better encoding detection (#548)
- recall: fix for lists inside tables with @mikhainin (#534)
- add symbol to preserve vertical spacing in Markdown (#499)
- fix: table cell separators in non-XML output (#563)
- slightly better accuracy and execution speed overall
Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (#561)
- fix: empty content in meta tag by @felipehertzer (#545)
Maintenance:
- restructure and simplify code (#543, #556)
- CLI & downloads: revamp and use global options (#565)
- eval: review code, add guidelines and small benchmark (#542)
- fix: raise error if config file does not exist (#554)
- deprecate
process_record()
(#549) - docs: convert readme to markdown and update info (#564, #578)
trafilatura-1.8.1
trafilatura-1.8.0
Extraction:
- Better precision by @felipehertzer (#509, #520)
- Code formatting in TXT/Markdown output added (#498)
- Improved CSV output (#496)
- LXML: compile XPath expressions (#504)
- Overall speedup about +5%
Downloads and Navigation:
- More robust scans with
is_live_page()
(#501) - Better sitemap start and safeguards (#503, #506)
- Fix for headers in response object (#513)
Maintenance:
trafilatura-1.7.0
trafilatura-1.6.4
Maintenance:
- MacOS: fix setup, update htmldate and add tests (#460)
- drop invalid XML element attributes with @vbarbaresi in #462
- remove cyclic imports (#458)
Navigation:
- introduce
MAX_REDIRECTS
config setting and fix urllib3 redirect handling by @vbarbaresi in #461 - improve feed detection (#457)
Documentation:
trafilatura-1.6.3
Extraction:
- preserve space in certain elements with @idoshamun (#429)
- optional list of xPaths to prune by @HeLehm (#414)
Metadata:
- more precise date extraction (see htmldate)
- new
htmldate
extensive search parameter in config (#434) - changes in URLs: normalization, trackers removed (see courlan)
Navigation:
Documentation:
trafilatura-1.6.2
Extraction:
- more lenient HTML parsing (#370)
- improved code block support with @idoshamun (#372, #401)
- convertion of relative links to absolute by @feltcat (#377)
- remove use of signal from core functions (#384)
Metadata:
- JSON-LD fix for sitenames by @felipehertzer (#383)
Command-line interface:
- more robust batch processing (#381)
- added
--probe
option to CLI to check for extractable content (#378, #392)
Maintenance:
- simplified code (#408)
- support for Python 3.12
- pinned LXML version for MacOS (#393)
- updated dependencies and parameters (notably
htmldate
andcourlan
) - code cleaning by @marksmayo (#406)
trafilatura-1.6.1
Extraction:
Metadata:
- simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
- authors, JSON and unicode fixes by @felipehertzer in #365
- fix for authors without
additionalName
by @awwitecki in #363
Navigation:
- reviewed link processing in feeds and sitemaps (#340, #350)
- more robust spider (#359)
- updated underlying courlan package (#360)
Full Changelog: v1.6.0...v1.6.1
trafilatura-1.6.0
Extraction:
- new content hashes and default file names (#314)
- fix deprecation warning with @sdondley in #321
- fix for metadata image by @andremacola in #328
- fix potential unicode issue in third-party extraction with @Korben00 in #331
- review logging levels (#347)
Command-line interface:
- more efficient sitemap processing (#326)
- more efficient downloads (#338)
- fix for single URL processing (#324) and URL blacklisting (#339)
Navigation
- additional safety check on domain similarity for feeds and sitemaps
- new function
is_live test()
using HTTP HEAD request (#327) - code parts supported by new courlan version
Maintenance
- allow
urllib3
version 2.0+ - minor code simplification and fixes
Full Changelog: v1.5.0...v1.6.0
trafilatura-1.5.0
Extraction:
- fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
- pagetype and image urls added to metadata by @andremacola (#282, #310)
- add as_dict method to Document class with @edkrueger in #306
- XML output fix with @knit-bee in #315
- various smaller fixes: lists (#309), XPaths, metadata hardening
Navigation:
Maintenance:
- simplify code and extend tests
- underlying packages htmldate and courlan, update setup and docs
Full Changelog: v1.4.1...v1.5.0