Releases · adbar/trafilatura

19 Jan 17:02

adbar

v1.4.1

14d9782

v1.4.1

Extraction:

extraction bugs fixed (#263, #266), more robust HTML doctype parsing
XML output improvements by @knit-bee (#273, #274)
adjust thresholds for link density in paragraphs

Metadata:

improved title and sitename detection (#284)
faster author, categories, domain name, and tags extraction
fixes to author emoji regexes by @felipehertzer (#269)

Command-line interface:

review argument consistency and add deprecation warnings (#261)

Setup:

make download timeout configurable (#263)
updated dependencies, use of faust-cchardet for Python 3.11

Full Changelog: v1.4.0...v1.4.1

Contributors

felipehertzer and knit-bee

Assets 2

18 Oct 13:59

adbar

v1.4.0

f9e35aa

trafilatura-1.4.0

Impact on extraction and output format:

better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
XML: preserve list type as attribute (#229)
XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
faster text cleaning and shorter code (#237 with @deedy5, #245)
metadata: add language when detector is activated (#224)
metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
TXT: change markdown formatting of headers by @LaundroMat (#257)

Smaller changes in convenience functions:

add function to clear caches (#219)
CLI: change exit code if download fails (#223)
settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

docs updated (and #244 by @dsgibbons)
package dependencies updated

Full Changelog: v1.3.0...v1.4.0

Contributors

LaundroMat, mrienstra, and 5 other contributors

Assets 2

29 Jul 14:42

adbar

v1.3.0

c3f9a9f

trafilatura-1.3.0

fast and robust html2txt() function added (#221)
more robust parsing (#228)
fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
extraction about 10-20% faster, slightly better recall
partial fixes for memory leaks (#216)
docs extended and updated (#217, #225)
prepared deprecation of old process_record() function
more stable processing with updated dependencies

Full Changelog: v1.2.2...v1.3.0

Contributors

felipehertzer

Assets 2

18 May 15:55

adbar

v1.2.2

168e660

trafilatura-1.2.2

more efficient rules for extraction
metadata: further attributes used (with @felipehertzer)
better baseline extraction
issues fixed: #202, #204, #205
evaluation updated

Full Changelog: v1.2.1...v1.2.2

Contributors

felipehertzer

Assets 2

02 May 10:24

adbar

v1.2.1

1bb5fee

trafilatura-1.2.1

What's Changed

--precision and --recall arguments added to the CLI
better text cleaning: paywalls and comments
improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
further bugs fixed: #189, #192 (with @felipehertzer), #200
efficiency: faster module loading and improved RAM footprint

Full Changelog: v1.2.0...v1.2.1

Contributors

felipehertzer, glacierck, and immortal-autumn

Assets 2

07 Mar 11:49

adbar

v1.2.0

daf5d8d

trafilatura-1.2.0

efficiency: replaced module readability-lxml by trimmed fork
bugs fixed: (#179, #180, #183, #184)
improved baseline extraction
cleaner metadata (with @felipehertzer)

Full Changelog: v1.1.0...v1.2.0

Contributors

felipehertzer

Assets 2

21 Feb 16:28

adbar

v1.1.0

776eb91

trafilatura-1.1.0

encodings: better detection, output NFC-normalized Unicode
maintenance and performance: more efficient code
bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
prepare compatibility with upcoming Python 3.11
changed default settings
extended documentation

Full Changelog: v1.0.0...v1.1.0

Assets 2

30 Nov 17:27

adbar

v1.0.0

d7846a6

v1.0.0

compress HTML backup files & seamlessly open .gz files
support JSON web feeds
graphical user interface integrated into main package
faster downloads: reviewed backoff, compressed data
optional modules: downloads with pycurl, language identification with py3langid
bugs fixed (#111, #125, #132, #136, #140)
minor optimizations and fixes by @vbarbaresi in #124 & #130
fixed array with single or multiples entries on json extractor by @felipehertzer in #143
code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
drop support for Python 3.5

Full Changelog: v0.9.3...v1.0.0

Contributors

felipehertzer, vbarbaresi, and sourcery-ai

Assets 2

21 Oct 17:25

adbar

v0.9.3

0546265

trafilatura-0.9.3

better, faster encoding detection: replaced chardet with charset_normalizer
faster execution: updated justext to 3.0
better extraction of sub-elements in tables (#78, #90)
more robust web feed parsing
further defined precision- and recall-oriented settings
license extraction in footers (#118)

Full Changelog: v0.9.2...v0.9.3

Assets 2

06 Oct 16:08

adbar

v0.9.2

85e28c6

trafilatura-0.9.2

first precision- and recall-oriented presets defined
improvements in authorship extraction (thanks @felipehertzer)
requesting TXT output with formatting now results in Markdown format
bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
setting for cookies in request headers (thanks @muellermartin)
better date extraction thanks to htmldate update

Contributors

muellermartin and felipehertzer

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

Contributors

Contributors

Contributors

What's Changed

Contributors

Contributors

Contributors

Contributors

Releases: adbar/trafilatura

v1.4.1

Contributors

trafilatura-1.4.0

Contributors

trafilatura-1.3.0

Contributors

trafilatura-1.2.2

Contributors

trafilatura-1.2.1

What's Changed

Contributors

trafilatura-1.2.0

Contributors

trafilatura-1.1.0

v1.0.0

Contributors

trafilatura-0.9.3

trafilatura-0.9.2

Contributors