Skip to content

Releases: adbar/trafilatura

v1.4.1

19 Jan 17:02
Compare
Choose a tag to compare

Extraction:

  • extraction bugs fixed (#263, #266), more robust HTML doctype parsing
  • XML output improvements by @knit-bee (#273, #274)
  • adjust thresholds for link density in paragraphs

Metadata:

  • improved title and sitename detection (#284)
  • faster author, categories, domain name, and tags extraction
  • fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

  • review argument consistency and add deprecation warnings (#261)

Setup:

  • make download timeout configurable (#263)
  • updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: v1.4.0...v1.4.1

trafilatura-1.4.0

18 Oct 13:59
Compare
Choose a tag to compare

Impact on extraction and output format:

Smaller changes in convenience functions:

  • add function to clear caches (#219)
  • CLI: change exit code if download fails (#223)
  • settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

Full Changelog: v1.3.0...v1.4.0

trafilatura-1.3.0

29 Jul 14:42
Compare
Choose a tag to compare
  • fast and robust html2txt() function added (#221)
  • more robust parsing (#228)
  • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
  • extraction about 10-20% faster, slightly better recall
  • partial fixes for memory leaks (#216)
  • docs extended and updated (#217, #225)
  • prepared deprecation of old process_record() function
  • more stable processing with updated dependencies

Full Changelog: v1.2.2...v1.3.0

trafilatura-1.2.2

18 May 15:55
Compare
Choose a tag to compare
  • more efficient rules for extraction
  • metadata: further attributes used (with @felipehertzer)
  • better baseline extraction
  • issues fixed: #202, #204, #205
  • evaluation updated

Full Changelog: v1.2.1...v1.2.2

trafilatura-1.2.1

02 May 10:24
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.2.0...v1.2.1

trafilatura-1.2.0

07 Mar 11:49
Compare
Choose a tag to compare
  • efficiency: replaced module readability-lxml by trimmed fork
  • bugs fixed: (#179, #180, #183, #184)
  • improved baseline extraction
  • cleaner metadata (with @felipehertzer)

Full Changelog: v1.1.0...v1.2.0

trafilatura-1.1.0

21 Feb 16:28
Compare
Choose a tag to compare
  • encodings: better detection, output NFC-normalized Unicode
  • maintenance and performance: more efficient code
  • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
  • prepare compatibility with upcoming Python 3.11
  • changed default settings
  • extended documentation

Full Changelog: v1.0.0...v1.1.0

v1.0.0

30 Nov 17:27
Compare
Choose a tag to compare
  • compress HTML backup files & seamlessly open .gz files
  • support JSON web feeds
  • graphical user interface integrated into main package
  • faster downloads: reviewed backoff, compressed data
  • optional modules: downloads with pycurl, language identification with py3langid
  • bugs fixed (#111, #125, #132, #136, #140)
  • minor optimizations and fixes by @vbarbaresi in #124 & #130
  • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
  • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
  • drop support for Python 3.5

Full Changelog: v0.9.3...v1.0.0

trafilatura-0.9.3

21 Oct 17:25
Compare
Choose a tag to compare
  • better, faster encoding detection: replaced chardet with charset_normalizer
  • faster execution: updated justext to 3.0
  • better extraction of sub-elements in tables (#78, #90)
  • more robust web feed parsing
  • further defined precision- and recall-oriented settings
  • license extraction in footers (#118)

Full Changelog: v0.9.2...v0.9.3

trafilatura-0.9.2

06 Oct 16:08
Compare
Choose a tag to compare
  • first precision- and recall-oriented presets defined
  • improvements in authorship extraction (thanks @felipehertzer)
  • requesting TXT output with formatting now results in Markdown format
  • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
  • setting for cookies in request headers (thanks @muellermartin)
  • better date extraction thanks to htmldate update