Skip to content

Releases: adbar/trafilatura

trafilatura-1.9.0

02 May 10:18
11255bd
Compare
Choose a tag to compare

Extraction:

  • add markdown as explicit output (#550)
  • improve recall preset (#571)
  • speedup for readability-lxml (#547)
  • add global options object for extraction and use it in CLI (#552)
  • fix: better encoding detection (#548)
  • recall: fix for lists inside tables with @mikhainin (#534)
  • add symbol to preserve vertical spacing in Markdown (#499)
  • fix: table cell separators in non-XML output (#563)
  • slightly better accuracy and execution speed overall

Metadata:

  • add file creation date (date extraction, JSON & XML-TEI) (#561)
  • fix: empty content in meta tag by @felipehertzer (#545)

Maintenance:

  • restructure and simplify code (#543, #556)
  • CLI & downloads: revamp and use global options (#565)
  • eval: review code, add guidelines and small benchmark (#542)
  • fix: raise error if config file does not exist (#554)
  • deprecate process_record() (#549)
  • docs: convert readme to markdown and update info (#564, #578)

trafilatura-1.8.1

03 Apr 11:47
d9d47a7
Compare
Choose a tag to compare

Maintenance:

  • Pin LXML to prevent broken dependency (#535)

Extraction:

  • Improve extraction accuracy for major news outlets (#530)
  • Fix formatting by correcting order of element generation and space handling with @dlwh (#528)
  • Fix: prevent tail insertion before children in nested elements by @knit-bee (#536)

trafilatura-1.8.0

20 Mar 15:24
ff38644
Compare
Choose a tag to compare

Extraction:

  • Better precision by @felipehertzer (#509, #520)
  • Code formatting in TXT/Markdown output added (#498)
  • Improved CSV output (#496)
  • LXML: compile XPath expressions (#504)
  • Overall speedup about +5%

Downloads and Navigation:

  • More robust scans with is_live_page() (#501)
  • Better sitemap start and safeguards (#503, #506)
  • Fix for headers in response object (#513)

Maintenance:

  • License changed to Apache 2.0
  • Response class: convenience functions added (#497)
  • lxml.html.Cleaner removed (#491)
  • CLI fixes: parallel cores and processing (#524)

trafilatura-1.7.0

25 Jan 13:05
97dc088
Compare
Choose a tag to compare

Extraction:

  • improved html2txt() function (#483)

Downloads:

  • add advanced fetch_response() function
    → pending deprecation for fetch_url(decode=False)

Maintenance:

trafilatura-1.6.4

08 Jan 14:33
85cd3d8
Compare
Choose a tag to compare

Maintenance:

  • MacOS: fix setup, update htmldate and add tests (#460)
  • drop invalid XML element attributes with @vbarbaresi in #462
  • remove cyclic imports (#458)

Navigation:

  • introduce MAX_REDIRECTS config setting and fix urllib3 redirect handling by @vbarbaresi in #461
  • improve feed detection (#457)

Documentation:

  • enhancements to documentation and testing with @Maddesea in #456

trafilatura-1.6.3

29 Nov 13:42
e7b5723
Compare
Choose a tag to compare

Extraction:

Metadata:

  • more precise date extraction (see htmldate)
  • new htmldate extensive search parameter in config (#434)
  • changes in URLs: normalization, trackers removed (see courlan)

Navigation:

  • reviewed code for feeds (#443)
  • new config option: external URLs for feeds/sitemaps (#441)

Documentation:

trafilatura-1.6.2

06 Sep 15:45
5ce31d9
Compare
Choose a tag to compare

Extraction:

  • more lenient HTML parsing (#370)
  • improved code block support with @idoshamun (#372, #401)
  • convertion of relative links to absolute by @feltcat (#377)
  • remove use of signal from core functions (#384)

Metadata:

Command-line interface:

  • more robust batch processing (#381)
  • added --probe option to CLI to check for extractable content (#378, #392)

Maintenance:

  • simplified code (#408)
  • support for Python 3.12
  • pinned LXML version for MacOS (#393)
  • updated dependencies and parameters (notably htmldate and courlan)
  • code cleaning by @marksmayo (#406)

trafilatura-1.6.1

15 Jun 12:59
d85d584
Compare
Choose a tag to compare

Extraction:

  • minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:

Navigation:

  • reviewed link processing in feeds and sitemaps (#340, #350)
  • more robust spider (#359)
  • updated underlying courlan package (#360)

Full Changelog: v1.6.0...v1.6.1

trafilatura-1.6.0

11 May 11:00
0bce218
Compare
Choose a tag to compare

Extraction:

  • new content hashes and default file names (#314)
  • fix deprecation warning with @sdondley in #321
  • fix for metadata image by @andremacola in #328
  • fix potential unicode issue in third-party extraction with @Korben00 in #331
  • review logging levels (#347)

Command-line interface:

  • more efficient sitemap processing (#326)
  • more efficient downloads (#338)
  • fix for single URL processing (#324) and URL blacklisting (#339)

Navigation

  • additional safety check on domain similarity for feeds and sitemaps
  • new function is_live test() using HTTP HEAD request (#327)
  • code parts supported by new courlan version

Maintenance

  • allow urllib3 version 2.0+
  • minor code simplification and fixes

Full Changelog: v1.5.0...v1.6.0

trafilatura-1.5.0

30 Mar 16:11
2639b24
Compare
Choose a tag to compare

Extraction:

Navigation:

  • transfer URL management to courlan.UrlStore (#232, #312)
  • fixes for spider module

Maintenance:

  • simplify code and extend tests
  • underlying packages htmldate and courlan, update setup and docs

Full Changelog: v1.4.1...v1.5.0