Skip items that cause errors #58

Jalkhov · 2020-10-21T11:46:31Z

Let's assume that for now there is no way to process the floating images, as an enhancement I would like to make a small recommendation for future updates. It would be very useful a parameter that allows omitting the images or objects that cause errors, and so that the incoming pdf file, although it has unprocessable elements, can be omitted and get the output file without these elements, and then one as a programmer is responsible for making these clarifications to the user.

dothinking · 2020-10-22T08:34:10Z

This makes sense. How about set "omitting the images or objects that cause errors" as a default behavior, and show log information when this happened? Thanks for your suggestion.

Jalkhov · 2020-10-22T10:58:42Z

How about set "omitting the images or objects that cause errors" as a default behavior, and show log information when this happened?

Also, I think it's great, thanks for taking it into consideration. I'll be using this library a lot so you'll see me around a lot, it's the best and easiest to use and I feel it has a lot of potential for more features.

I think that the following information of the omitted items can be shown in the log information:
Page, type (table, image...), and that somehow the respective blank space is left where the element was, this way even if elements have been omitted there will be no change in the order or number of pages.

dothinking · 2020-10-22T14:08:22Z

I'll be using this library a lot so you'll see me around a lot, it's the best and easiest to use and I feel it has a lot of potential for more features.

This library is rule-based to map pdf objects to docx, e.g. some texts surrounded by horizontal/vertical lines -> a table in docx. The limited rules never accommodate all cases, so definitely a lot of potential features/enhancements. Welcome and thanks for make it grow up, so that it can benefit for more people.

Page, type (table, image...), and that somehow the respective blank space is left where the element was

Good point. Just one comment: as a layout format for printing, what we extract from pdf is either text or image or shape (like a line, a rectangle) and their coordinates in the page. So, of course, the blank space is preserved, but regarding the type, I'm afraid it can provide image only since no 'table' exists for pdf.

Jalkhov · 2020-10-22T16:36:48Z

Welcome and thanks for make it grow up, so that it can benefit for more people.

Thanks, I will be testing with different files with different contents to see how the library reacts to each one and if there is any failure I will be leaving it here (in issues) with the detailed information..

I'm afraid it can provide image only since no 'table' exists for pdf.

When I said "table" I meant things like this:

Although I just sensed that that counts as simple lines, sorry, bad way to refer to that. In the same way the idea is that, to say the type of element that has been omitted, I do not know the truth what type of element to mention apart from an image, but the idea is already clear hehe.

dothinking · 2020-12-31T18:24:28Z

Didn't get time to this project for so long a time. A new version was released finally at this moment, the first day of New Year. :) It gets improved on image extraction, e.g. floating image, and paragraph format. Hope to make progress on this issue.

pip install --upgrade pdf2docx

dothinking self-assigned this Oct 22, 2020

dothinking added the enhancement New feature or request label Oct 22, 2020

dothinking added a commit that referenced this issue Dec 31, 2020

tweak Converter #58 #67

dcfdd56

dothinking closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip items that cause errors #58

Skip items that cause errors #58

Jalkhov commented Oct 21, 2020

dothinking commented Oct 22, 2020

Jalkhov commented Oct 22, 2020

dothinking commented Oct 22, 2020

Jalkhov commented Oct 22, 2020

dothinking commented Dec 31, 2020

Skip items that cause errors #58

Skip items that cause errors #58

Comments

Jalkhov commented Oct 21, 2020

dothinking commented Oct 22, 2020

Jalkhov commented Oct 22, 2020

dothinking commented Oct 22, 2020

Jalkhov commented Oct 22, 2020

dothinking commented Dec 31, 2020