Releases: jsvine/pdfplumber
Releases · jsvine/pdfplumber
v0.11.0
Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber
reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six
's latest release (which provides more detailed paths for curves), and some fixes.
Added
- Add
{line,char}_dir{,rotated,render}
params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45) - Add
curve["path"]
andcurve["dash"]
, thanks topdfminer.six
upgrade (see below). (1820247)
Changed
- Upgrade
pdfminer.six
from20221105
to20231228
. (cd2f768) - Change value of in
word["direction"]
from{1,-1}
to{"ltr","rtl","ttb","btt"}
. (850fd45) - Deprecate
vertical_ttb
,horizontal_ltr
in favor ofchar_dir
andchar_dir_rotated
.(850fd45)
Fixed
v0.10.4
Added
- Add
x_tolerance_ratio
parameter toextract_text
and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041) - Add support for PDF 1.3 logical structure via
Page.structure_tree
(h/t @dhdaines). (#963) - Add "gswin64c" as another possible Ghostscript executable in
repair.py
(h/t @echedey-ls). (#1032) - Re-add
Page.close()
method, havePDF.close()
close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042) - Add
force_mediabox
parameter toPage.to_image(...)
. (#1054)
Fixed
- Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
- Fix
Page.get_textmap
caching to allow forextra_attrs=[...]
, by preconverting list kwargs to tuples. (#1030) - Explicitly close
pypdfium2.PdfDocument
inget_page_image
(h/t @dhdaines). (#1090) - In
PDFPageAggregatorWithMarkedContent.tag_cur_item
, checkself.cur_item._objs
length before trying to access[-1]
. (4f39d03)
v0.10.3
Added
- Add support for marked-content sequences, represented by
mcid
andtag
attributes onchar
/rect
/line
/curve
/image
objects (h/t @dhdaines). (#961) - Add
gs_path
argument topdfplumber.open(...)
andpdfplumber.repair(...)
, to allow passing a custom Ghostscript path to be used for repairing. (#953)
Fixed
v0.10.2
v0.10.1
v0.10.0
Changed
- Normalize color representation to
tuple[float|int, ...]
(#917). (57d51bb) - Replace Wand with pypdfium2 for page.to_image(...). (b049373)
Added
- Add
pdfplumber.repair(...)
and.open(repair=True)
(#824). (db6ae97) - Add Page.find_table(...) (#873). (3772af6)
- Add
quantize=True
,colors=256
,bits=8
arguments/defaults toPageImage.save(...)
. (b049373) - Extract and handle patterns + (some) color spaces. (97ca4b0)
Removed
- Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
- Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)
Fixed
v0.9.0
Changed
- Make word segmentation (via
WordExtractor.char_begins_new_word(...)
) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840) - Use
curve_edge
objects (instead of justline
andrect_edge
objects) in default table-detection strategy. (6f6b465 + #858) - By default, expand ligatures into their consituent letters (e.g.,
ffi
toffi
), and add theexpand_ligatures
boolean parameter to text-extraction methods. (86e935d + #598)
Added
- Add
Page.extract_text_lines(...)
method. (4b37397 + #852) - Add
main_group
,return_groups
,return_chars
parameters toPage.search(...)
. (4b37397) - Add
.curve_edges
property toPDF
andPage
. (6f6b465)
Fixed
v0.8.0
Changed
- Change the (still experimental)
Page/utils.extract_text(layout=True)
approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de) - Refactor handling of
pts
attribute and, in doing so, deprecate thecurve_obj["points"]
attribute, and fixPageImage.draw_line(...)
's handling of diagonal lines. (216bedd) - Breaking change: In
Page.extract_table[s](...)
,keep_blank_chars
must now be passed astext_keep_blank_chars
, for consistency's sake. (c4e1b29)
Added
- Add
Page.extract_table[s](...)
support for allPage.extract_text(...)
keyword arguments. (c4e1b29) - Add
height
andwidth
keyword arguemnts toPage.to_image(...)
. (#798 + 93f7dbd) - Add
layout_width
,layout_width_chars
,layout_height
, andlayout_width_chars
parameters toPage/utils.extract_text(layout=True)
. (d3662de) - Add CITATION.cff. (#755) [h/t @joaoccruz]
Fixed
- Fix simple edge-case for when page rotation is (incorrectly) set to
None
. (#811) [h/t @toshi1127]
Development Changes
- Convert
utils.py
intoutils/
submodules. Retains same interface, just an improvement in organization. (6351d97) - Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
- Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via
utils.extract_text(...)
, viaPage.extract_text(...)
, viaPage.extract_table(...)
). (3424b57)
v0.7.6
Changed
- Bump pinned
pdfminer.six
version to20221105
. (e63a038)
Fixed
- Restore
text
attribute to.textboxhorizontal
/etc., regression introduced in9587cc7
/v0.6.2
. (8a0c126) - Fix
lru_cache
usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)
Development Changes
- Upgrade
nbexec
development requirement from0.1.0
to0.2.0
. (30dac25)
v0.7.5
Added
Fixed
- Fixed issue where
py.typed
file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes] - Reinstated the ability to call
utils.cluster_objects(...)
with any hashable value (str
,int
,tuple
, etc.) as thekey_fn
parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]