Releases · jsvine/pdfplumber

07 Mar 12:57

jsvine

v0.11.0

53306dc

v0.11.0 Latest

Latest

Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six's latest release (which provides more detailed paths for curves), and some fixes.

Added

Add {line,char}_dir{,rotated,render} params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45)
Add curve["path"] and curve["dash"], thanks to pdfminer.six upgrade (see below). (1820247)

Changed

Upgrade pdfminer.six from 20221105 to 20231228. (cd2f768)
Change value of in word["direction"] from {1,-1} to {"ltr","rtl","ttb","btt"}. (850fd45)
Deprecate vertical_ttb, horizontal_ltr in favor of char_dir and char_dir_rotated.(850fd45)

Fixed

Fix layout-caching issue caused by 0bfffc2. (#1097 + efca277)
Fix missing ParentTree edge-case. (#1094))

Contributors

afriedman412

Assets 2

10 Feb 23:38

jsvine

v0.10.4

3bb642b

v0.10.4

Added

Add x_tolerance_ratio parameter to extract_text and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041)
Add support for PDF 1.3 logical structure via Page.structure_tree (h/t @dhdaines). (#963)
Add "gswin64c" as another possible Ghostscript executable in repair.py (h/t @echedey-ls). (#1032)
Re-add Page.close() method, have PDF.close() close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042)
Add force_mediabox parameter to Page.to_image(...). (#1054)

Fixed

Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
Fix Page.get_textmap caching to allow for extra_attrs=[...], by preconverting list kwargs to tuples. (#1030)
Explicitly close pypdfium2.PdfDocument in get_page_image (h/t @dhdaines). (#1090)
In PDFPageAggregatorWithMarkedContent.tag_cur_item, check self.cur_item._objs length before trying to access [-1]. (4f39d03)

Contributors

dhdaines, luketudge, and 2 other contributors

Assets 2

26 Oct 14:08

jsvine

v0.10.3

2e838d1

v0.10.3

Added

Add support for marked-content sequences, represented by mcid and tag attributes on char/rect/line/curve/image objects (h/t @dhdaines). (#961)
Add gs_path argument to pdfplumber.open(...) and pdfplumber.repair(...), to allow passing a custom Ghostscript path to be used for repairing. (#953)

Fixed

Respect use_text_flow in extract_text (h/t @dhdaines). (#983)

Contributors

dhdaines

Assets 2

29 Jul 19:04

jsvine

v0.10.2

f92a687

v0.10.2

Added

Add PDF.path: A Path object for PDFs loaded by passing a path (unless repair=True), and None otherwise. (30a52cb + #948)
Accept Iterable objects for geometry utils (h/t @dhdaines). (53bee23 + #945)

Changed

Use pypdfium2's public (not private) .render(...) method (h/t @mara004). (28f4ebe + #899)

Fixed

Fix .to_image() for ZipExtFiles (h/t @Urbener). (30a52cb + #948)

Contributors

dhdaines, mara004, and Urbener

Assets 2

19 Jul 19:03

jsvine

v0.10.1

90742bd

v0.10.1

A simple release:

Added

Add antialias boolean parameter to Page.to_image(...) and associated methods (h/t @cmdlineluser). (7e28931)

Contributors

cmdlineluser

Assets 2

16 Jul 22:37

jsvine

v0.10.0

00386ad

v0.10.0

Changed

Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
Add Page.find_table(...) (#873). (3772af6)
Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)

Fixed

Fix bug for re-crops that use relative=True (#914). (0de6da9)
Handle use_text_flow more consistently (#912). (b1db5b8)

Assets 2

13 Apr 12:58

jsvine

v0.9.0

255eaac

v0.9.0

Changed

Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
By default, expand ligatures into their consituent letters (e.g., ﬃ to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

Add Page.extract_text_lines(...) method. (4b37397 + #852)
Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

Assets 2

14 Feb 03:05

jsvine

v0.8.0

b6847ad

v0.8.0

Changed

Change the (still experimental) Page/utils.extract_text(layout=True) approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de)
Refactor handling of pts attribute and, in doing so, deprecate the curve_obj["points"] attribute, and fix PageImage.draw_line(...)'s handling of diagonal lines. (216bedd)
Breaking change: In Page.extract_table[s](...), keep_blank_chars must now be passed as text_keep_blank_chars, for consistency's sake. (c4e1b29)

Added

Add Page.extract_table[s](...) support for all Page.extract_text(...) keyword arguments. (c4e1b29)
Add height and width keyword arguemnts to Page.to_image(...). (#798 + 93f7dbd)
Add layout_width, layout_width_chars, layout_height, and layout_width_chars parameters to Page/utils.extract_text(layout=True). (d3662de)
Add CITATION.cff. (#755) [h/t @joaoccruz]

Fixed

Fix simple edge-case for when page rotation is (incorrectly) set to None. (#811) [h/t @toshi1127]

Development Changes

Convert utils.py into utils/ submodules. Retains same interface, just an improvement in organization. (6351d97)
Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via utils.extract_text(...), via Page.extract_text(...), via Page.extract_table(...)). (3424b57)

Contributors

toshi1127, joaoccruz, and conitrade-as

Assets 2

22 Nov 18:03

jsvine

v0.7.6

f6741d3

v0.7.6

Changed

Bump pinned pdfminer.six version to 20221105. (e63a038)

Fixed

Restore text attribute to .textboxhorizontal/etc., regression introduced in 9587cc7 / v0.6.2. (8a0c126)
Fix lru_cache usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)

Development Changes

Upgrade nbexec development requirement from 0.1.0 to 0.2.0. (30dac25)

Assets 2

01 Oct 13:50

jsvine

v0.7.5

5aca57c

v0.7.5

Added

Add PageImage.show() as alias for PageImage.annotated.show(). (#715 + 5c7787b)

Fixed

Fixed issue where py.typed file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]
Reinstated the ability to call utils.cluster_objects(...) with any hashable value (str, int, tuple, etc.) as the key_fn parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]

Development Changes

Update Wand version in requirements.txt from >=0.6.7 to >=0.6.10. (#713 + 3457d79)

Contributors

jfuruness and jhonatan-lopes

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added

Changed

Fixed

Contributors

Added

Fixed

Contributors

Added

Fixed

Contributors

Added

Changed

Fixed

Contributors

Added

Contributors

Changed

Added

Removed

Fixed

Changed

Added

Fixed

Changed

Added

Fixed

Development Changes

Contributors

Changed

Fixed

Development Changes

Added

Fixed

Development Changes

Contributors

Releases: jsvine/pdfplumber

v0.11.0

Added

Changed

Fixed

Contributors

v0.10.4

Added

Fixed

Contributors

v0.10.3

Added

Fixed

Contributors

v0.10.2

Added

Changed

Fixed

Contributors

v0.10.1

Added

Contributors

v0.10.0

Changed

Added

Removed

Fixed

v0.9.0

Changed

Added

Fixed

v0.8.0

Changed

Added

Fixed

Development Changes

Contributors

v0.7.6

Changed

Fixed

Development Changes

v0.7.5

Added

Fixed

Development Changes

Contributors