Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multiprocessing to parallely process PDF pages #20

Open
vinayak-mehta opened this issue Jul 5, 2019 · 17 comments
Open

Use multiprocessing to parallely process PDF pages #20

vinayak-mehta opened this issue Jul 5, 2019 · 17 comments
Labels
enhancement New feature or request
Projects

Comments

@vinayak-mehta
Copy link
Member

>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)

We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.

@satheeshkatipomu
Copy link

Hi @vinayak-mehta ,

Even I thought of implementing this. dramatiq or celery are my suggestions for asynchronous processing of pages.

@jontis
Copy link

jontis commented Sep 21, 2019

I'm doing this with dask but it's chosen out of habit.

@selcukusta
Copy link

selcukusta commented Nov 7, 2019

Is there any improvement in there? I have a file that has only one page. The page has a table (25 rows x 13 columns). read_pdf function takes 10 seconds after that to_excel takes only 100-150 ms. I'm thinking about 10 seconds is too long, am I wrong?

@NixBiks
Copy link

NixBiks commented Nov 11, 2019

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

@vinayak-mehta
Copy link
Member Author

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

Yes!

But does anyone have a solution for multiple pages in parallel?

Using multiprocessing, we should be able to distribute multiple pages on all cores, processing them in parallel.

@NixBiks
Copy link

NixBiks commented Nov 11, 2019

I get this though

objc[53475]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Oh; what is the difference between https://github.com/atlanhq/camelot and https://github.com/camelot-dev/camelot ? Didn't notice two repos before now...

@selcukusta
Copy link

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

Yeah, I know. Actually it's related with that but the issue was closed and referenced to it.

@rawsh-bt
Copy link

rawsh-bt commented Jun 26, 2020

Does anyone have an update? I've tried inheriting PageHandler and making pages multithreaded / multicore, and multi threading processing multiple pdfs, but I'm running into a ghostscript error (seems like it's not thread safe?)

@vinayak-mehta vinayak-mehta added this to Backlog in TODO! Jul 9, 2020
@vinayak-mehta vinayak-mehta moved this from Backlog to To do in TODO! Jul 9, 2020
@phoewass
Copy link

phoewass commented Sep 2, 2020

I did implement a multi-threading layer above camelot.read_pdf using multiprocessing library.
I faced a couple of pitfalls doing it, so I can help on this if I may.

@vinayak-mehta
Copy link
Member Author

@phoewass That would be awesome if you're still interested!

@RickyGunawan09
Copy link

can anyone tell me how to use multiprocess in camelot ? or did this issues still on progress ?

@phoewass
Copy link

phoewass commented May 1, 2021

Hi all. Sorry it took me a while to publish the PR while the code was already available.
Now the PR is there to be reviewed, I'm looking forward for your feedback.

@vinayak-mehta
Copy link
Member Author

👀

@Siddharth1India
Copy link

Any update on this? My PDFs are 100s of pages and I can really use this feature.

@mlbrothers
Copy link

@phoewass @vinayak-mehta is this feature part of library now? If not, is there any way I can utilize multiprocessing to read multipage PDF?

@deepakagrawal
Copy link

Any updates on this features?

@bosd
Copy link

bosd commented Mar 22, 2024

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

There is a discussion about this in:

py-pdf#8 (reply in thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
TODO!
  
To do
Development

Successfully merging a pull request may close this issue.