Add functionality to extract PDF text from specific regions #62

PavlosMelissinos · 2021-01-12T00:24:39Z

Description of your pull request

(Feel free to squash & merge and use this as a commit message!)

Add functionality in pdfboxing.text to extract pdf text from specific regions

Large PDF documents can contain too much content to be properly parsed at once. It would often be preferable to locate the regions that contain the information and extract text from those instead, increasing parsing accuracy and retaining the semantics at the same time.

It's a rather small change that should not introduce a significant maintenance overhead.

Addresses #61

Pull request checklist

Before submitting the PR make sure the following things have been done
(and denote this by checking the relevant checkboxes):

The code is consistent with Clojure style guide.
All code passes the linter (clj-kondo --lint src).
You've added tests (if possible) to cover your change(s).
All tests are passing.
The commits are consistent with the Git commit style guide.
You've updated the changelog (if adding/changing user-visible functionality).

Thanks!

I had added my change in the beginning of the changelog, incorrectly. This commit fixes that mistake.

dotemacs · 2021-01-12T00:29:29Z

Thanks for your work on this @PavlosMelissinos!

I'll try to merge soon. If you don't see any movement on this, do ping me to remind me.

Thanks again!

PavlosMelissinos · 2021-01-13T22:59:51Z

I just pushed a tiny commit that updates the docstring of the function!

I also realized that:

extract-by-areas is not very robust without specs (if the user omits a coordinate it crashes) and
I haven't added a section in the README.

How do you feel about defaulting missing coordinates to 0?
As in:

{:w 280
 :h 100}

should give the same result as:

{:x           0
 :y           0
 :w           280
 :h           100
 :page-number 0}

* Missing coordinates are now assumed 0 * Added new test case with missing coords

PavlosMelissinos · 2021-01-15T23:58:48Z

I've been thinking about this for a while and, well, having area-text throw an exception if a coordinate is missing doesn't sit right with me. So I've:

made 0 the default value of coordinates, according to the example in my previous comment and
added docs for the function

I think I like it better this way but let me know what you think and I'll revert if needed...

PavlosMelissinos · 2021-02-13T14:46:37Z

@dotemacs what do you say? 🙂

PavlosMelissinos · 2021-06-02T07:57:42Z

So...? 😄

dotemacs · 2021-06-02T08:01:25Z

Hey @PavlosMelissinos

Sorry for the delay.
Thanks for doing this!

Looking at it quickly, again, it looks good. But I want to look at it properly and try it before merging.

Thanks for your work :)

dotemacs · 2021-10-09T13:39:59Z

I'll merge this this weekend and I'll resolve the merge conflict in the CHANGELOG.

Sorry for the delay

…ext-by-areas

PavlosMelissinos · 2021-10-09T16:49:31Z

No pressure at all, I think we have enough stress in our lives already!
FYI I've resolved the conflict.

dotemacs · 2021-10-10T09:14:51Z

src/pdfboxing/text.clj

+
+(defn extract-by-areas
+  "get text from specified areas of a PDF document"
+  [pdfdoc areas]


Hey @PavlosMelissinos

Can you tell me what was your thinking here?

Why is pdfdoc an argument on it's own and areas is a map?

Why can't it all go into a map?

My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.

Thanks

I think it's clearer this way. extract-by-areas is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.

I don't have very strong feelings about this though, it's your library 🙂

I started off using mostly rest arguments for the functions in the library.

Then I accepted some PRs which used strict arity.

Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.

Thanks

Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!

src/pdfboxing/text.clj

dotemacs · 2021-10-10T09:17:26Z

No pressure at all, I think we have enough stress in our lives already! FYI I've resolved the conflict.

Thanks for the kind words and for your work here :)

I left some comments, let me know what you think.

Thanks

Pavlos Melissinos added 4 commits January 12, 2021 01:40

Extract pdf text by areas

5e2a817

Appease the linter monster

2516b55

Update changelog

05a146e

Reorder changelog

a66aff3

I had added my change in the beginning of the changelog, incorrectly. This commit fixes that mistake.

PavlosMelissinos changed the title ~~Extract pdf text by areas~~ Extract PDF text by areas Jan 12, 2021

PavlosMelissinos changed the title ~~Extract PDF text by areas~~ Add functionality to extract PDF text from specific regions Jan 12, 2021

Update function docstring to reflect reality

133eee2

Pavlos Melissinos added 2 commits January 16, 2021 00:58

Make area-text function more robust

af01aa0

* Missing coordinates are now assumed 0 * Added new test case with missing coords

Add documentation for extracting text from regions

46b6aee

Merge branch 'master' of github.com:dotemacs/pdfboxing into extract-t…

bb25fc0

…ext-by-areas

dotemacs reviewed Oct 10, 2021

View reviewed changes

src/pdfboxing/text.clj Outdated Show resolved Hide resolved

Make pdf area extraction eager with reduce

5d76933

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to extract PDF text from specific regions #62

Add functionality to extract PDF text from specific regions #62

PavlosMelissinos commented Jan 12, 2021 •

edited

dotemacs commented Jan 12, 2021

PavlosMelissinos commented Jan 13, 2021 •

edited

PavlosMelissinos commented Jan 15, 2021

PavlosMelissinos commented Feb 13, 2021

PavlosMelissinos commented Jun 2, 2021

dotemacs commented Jun 2, 2021

dotemacs commented Oct 9, 2021

PavlosMelissinos commented Oct 9, 2021 •

edited

dotemacs Oct 10, 2021

PavlosMelissinos Oct 11, 2021

dotemacs Oct 12, 2021

PavlosMelissinos Oct 13, 2021

dotemacs commented Oct 10, 2021

Add functionality to extract PDF text from specific regions #62

Are you sure you want to change the base?

Add functionality to extract PDF text from specific regions #62

Conversation

PavlosMelissinos commented Jan 12, 2021 • edited

Description of your pull request

Pull request checklist

dotemacs commented Jan 12, 2021

PavlosMelissinos commented Jan 13, 2021 • edited

PavlosMelissinos commented Jan 15, 2021

PavlosMelissinos commented Feb 13, 2021

PavlosMelissinos commented Jun 2, 2021

dotemacs commented Jun 2, 2021

dotemacs commented Oct 9, 2021

PavlosMelissinos commented Oct 9, 2021 • edited

dotemacs Oct 10, 2021

Choose a reason for hiding this comment

PavlosMelissinos Oct 11, 2021

Choose a reason for hiding this comment

dotemacs Oct 12, 2021

Choose a reason for hiding this comment

PavlosMelissinos Oct 13, 2021

Choose a reason for hiding this comment

dotemacs commented Oct 10, 2021

PavlosMelissinos commented Jan 12, 2021 •

edited

PavlosMelissinos commented Jan 13, 2021 •

edited

PavlosMelissinos commented Oct 9, 2021 •

edited