Skip to content

Releases: jsvine/pdfplumber

v0.11.0

07 Mar 12:57
Compare
Choose a tag to compare

Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six's latest release (which provides more detailed paths for curves), and some fixes.

Added

  • Add {line,char}_dir{,rotated,render} params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45)
  • Add curve["path"] and curve["dash"], thanks to pdfminer.six upgrade (see below). (1820247)

Changed

  • Upgrade pdfminer.six from 20221105 to 20231228. (cd2f768)
  • Change value of in word["direction"] from {1,-1} to {"ltr","rtl","ttb","btt"}. (850fd45)
  • Deprecate vertical_ttb, horizontal_ltr in favor of char_dir and char_dir_rotated.(850fd45)

Fixed

  • Fix layout-caching issue caused by 0bfffc2. (#1097 + efca277)
  • Fix missing ParentTree edge-case. (#1094))

v0.10.4

10 Feb 23:38
Compare
Choose a tag to compare

Added

  • Add x_tolerance_ratio parameter to extract_text and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041)
  • Add support for PDF 1.3 logical structure via Page.structure_tree (h/t @dhdaines). (#963)
  • Add "gswin64c" as another possible Ghostscript executable in repair.py (h/t @echedey-ls). (#1032)
  • Re-add Page.close() method, have PDF.close() close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042)
  • Add force_mediabox parameter to Page.to_image(...). (#1054)

Fixed

  • Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
  • Fix Page.get_textmap caching to allow for extra_attrs=[...], by preconverting list kwargs to tuples. (#1030)
  • Explicitly close pypdfium2.PdfDocument in get_page_image (h/t @dhdaines). (#1090)
  • In PDFPageAggregatorWithMarkedContent.tag_cur_item, check self.cur_item._objs length before trying to access [-1]. (4f39d03)

v0.10.3

26 Oct 14:08
Compare
Choose a tag to compare

Added

  • Add support for marked-content sequences, represented by mcid and tag attributes on char/rect/line/curve/image objects (h/t @dhdaines). (#961)
  • Add gs_path argument to pdfplumber.open(...) and pdfplumber.repair(...), to allow passing a custom Ghostscript path to be used for repairing. (#953)

Fixed

v0.10.2

29 Jul 19:04
Compare
Choose a tag to compare

Added

  • Add PDF.path: A Path object for PDFs loaded by passing a path (unless repair=True), and None otherwise. (30a52cb + #948)

  • Accept Iterable objects for geometry utils (h/t @dhdaines). (53bee23 + #945)

Changed

Fixed

v0.10.1

19 Jul 19:03
Compare
Choose a tag to compare

A simple release:

Added

  • Add antialias boolean parameter to Page.to_image(...) and associated methods (h/t @cmdlineluser). (7e28931)

v0.10.0

16 Jul 22:37
00386ad
Compare
Choose a tag to compare

Changed

  • Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
  • Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

  • Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
  • Add Page.find_table(...) (#873). (3772af6)
  • Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
  • Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Fixed

  • Fix bug for re-crops that use relative=True (#914). (0de6da9)
  • Handle use_text_flow more consistently (#912). (b1db5b8)

v0.9.0

13 Apr 12:58
255eaac
Compare
Choose a tag to compare

Changed

  • Make word segmentation (via WordExtractor.char_begins_new_word(...)) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)
  • Use curve_edge objects (instead of just line and rect_edge objects) in default table-detection strategy. (6f6b465 + #858)
  • By default, expand ligatures into their consituent letters (e.g., to ffi), and add the expand_ligatures boolean parameter to text-extraction methods. (86e935d + #598)

Added

  • Add Page.extract_text_lines(...) method. (4b37397 + #852)
  • Add main_group, return_groups, return_chars parameters to Page.search(...). (4b37397)
  • Add .curve_edges property to PDF and Page. (6f6b465)

Fixed

  • Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
  • Fix handling of whitespace-only and empty results of Page.search(...). (6f6b465 + #853)

v0.8.0

14 Feb 03:05
Compare
Choose a tag to compare

Changed

  • Change the (still experimental) Page/utils.extract_text(layout=True) approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de)
  • Refactor handling of pts attribute and, in doing so, deprecate the curve_obj["points"] attribute, and fix PageImage.draw_line(...)'s handling of diagonal lines. (216bedd)
  • Breaking change: In Page.extract_table[s](...), keep_blank_chars must now be passed as text_keep_blank_chars, for consistency's sake. (c4e1b29)

Added

  • Add Page.extract_table[s](...) support for all Page.extract_text(...) keyword arguments. (c4e1b29)
  • Add height and width keyword arguemnts to Page.to_image(...). (#798 + 93f7dbd)
  • Add layout_width, layout_width_chars, layout_height, and layout_width_chars parameters to Page/utils.extract_text(layout=True). (d3662de)
  • Add CITATION.cff. (#755) [h/t @joaoccruz]

Fixed

  • Fix simple edge-case for when page rotation is (incorrectly) set to None. (#811) [h/t @toshi1127]

Development Changes

  • Convert utils.py into utils/ submodules. Retains same interface, just an improvement in organization. (6351d97)
  • Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
  • Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via utils.extract_text(...), via Page.extract_text(...), via Page.extract_table(...)). (3424b57)

v0.7.6

22 Nov 18:03
Compare
Choose a tag to compare

Changed

  • Bump pinned pdfminer.six version to 20221105. (e63a038)

Fixed

Development Changes

  • Upgrade nbexec development requirement from 0.1.0 to 0.2.0. (30dac25)

v0.7.5

01 Oct 13:50
Compare
Choose a tag to compare

Added

  • Add PageImage.show() as alias for PageImage.annotated.show(). (#715 + 5c7787b)

Fixed

  • Fixed issue where py.typed file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]
  • Reinstated the ability to call utils.cluster_objects(...) with any hashable value (str, int, tuple, etc.) as the key_fn parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]

Development Changes

  • Update Wand version in requirements.txt from >=0.6.7 to >=0.6.10. (#713 + 3457d79)