Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to extract PDF text from specific regions #62

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

PavlosMelissinos
Copy link

@PavlosMelissinos PavlosMelissinos commented Jan 12, 2021

Description of your pull request

(Feel free to squash & merge and use this as a commit message!)

Add functionality in pdfboxing.text to extract pdf text from specific regions

Large PDF documents can contain too much content to be properly parsed at once. It would often be preferable to locate the regions that contain the information and extract text from those instead, increasing parsing accuracy and retaining the semantics at the same time.

It's a rather small change that should not introduce a significant maintenance overhead.

Addresses #61

Pull request checklist

Before submitting the PR make sure the following things have been done
(and denote this by checking the relevant checkboxes):

  • The code is consistent with Clojure style guide.
  • All code passes the linter (clj-kondo --lint src).
  • You've added tests (if possible) to cover your change(s).
  • All tests are passing.
  • The commits are consistent with the Git commit style guide.
  • You've updated the changelog (if adding/changing user-visible functionality).

Thanks!

Pavlos Melissinos added 4 commits January 12, 2021 01:40
I had added my change in the beginning of the changelog, incorrectly.

This commit fixes that mistake.
@PavlosMelissinos PavlosMelissinos changed the title Extract pdf text by areas Extract PDF text by areas Jan 12, 2021
@dotemacs
Copy link
Owner

Thanks for your work on this @PavlosMelissinos!

I'll try to merge soon. If you don't see any movement on this, do ping me to remind me.

Thanks again!

@PavlosMelissinos PavlosMelissinos changed the title Extract PDF text by areas Add functionality to extract PDF text from specific regions Jan 12, 2021
@PavlosMelissinos
Copy link
Author

PavlosMelissinos commented Jan 13, 2021

I just pushed a tiny commit that updates the docstring of the function!

I also realized that:

  1. extract-by-areas is not very robust without specs (if the user omits a coordinate it crashes) and
  2. I haven't added a section in the README.

How do you feel about defaulting missing coordinates to 0?
As in:

{:w 280
 :h 100}

should give the same result as:

{:x           0
 :y           0
 :w           280
 :h           100
 :page-number 0}

Pavlos Melissinos added 2 commits January 16, 2021 00:58
* Missing coordinates are now assumed 0
* Added new test case with missing coords
@PavlosMelissinos
Copy link
Author

I've been thinking about this for a while and, well, having area-text throw an exception if a coordinate is missing doesn't sit right with me. So I've:

  1. made 0 the default value of coordinates, according to the example in my previous comment and
  2. added docs for the function

I think I like it better this way but let me know what you think and I'll revert if needed...

@PavlosMelissinos
Copy link
Author

@dotemacs what do you say? 🙂

@PavlosMelissinos
Copy link
Author

So...? 😄

@dotemacs
Copy link
Owner

dotemacs commented Jun 2, 2021

Hey @PavlosMelissinos

Sorry for the delay.
Thanks for doing this!

Looking at it quickly, again, it looks good. But I want to look at it properly and try it before merging.

Thanks for your work :)

@dotemacs
Copy link
Owner

dotemacs commented Oct 9, 2021

I'll merge this this weekend and I'll resolve the merge conflict in the CHANGELOG.

Sorry for the delay

@PavlosMelissinos
Copy link
Author

PavlosMelissinos commented Oct 9, 2021

No pressure at all, I think we have enough stress in our lives already!
FYI I've resolved the conflict.


(defn extract-by-areas
"get text from specified areas of a PDF document"
[pdfdoc areas]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @PavlosMelissinos

Can you tell me what was your thinking here?

Why is pdfdoc an argument on it's own and areas is a map?

Why can't it all go into a map?

My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clearer this way. extract-by-areas is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.

I don't have very strong feelings about this though, it's your library 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off using mostly rest arguments for the functions in the library.

Then I accepted some PRs which used strict arity.

Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!

src/pdfboxing/text.clj Outdated Show resolved Hide resolved
@dotemacs
Copy link
Owner

No pressure at all, I think we have enough stress in our lives already! FYI I've resolved the conflict.

Thanks for the kind words and for your work here :)

I left some comments, let me know what you think.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants