Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle duplicate documents during processing #2487

Open
JaCoB1123 opened this issue Feb 5, 2024 · 3 comments
Open

Handle duplicate documents during processing #2487

JaCoB1123 opened this issue Feb 5, 2024 · 3 comments

Comments

@JaCoB1123
Copy link
Contributor

I had duplicates in Docspell in some occasions because the duplicate check only handles processed documents. When I have a big document that takes long to process, and rerun dsc to upload local files, it will be added to the processing queue again.

@eikek
Copy link
Owner

eikek commented Feb 5, 2024

Hi @JaCoB1123 this is possible by reducing the number of parallel processes to 1. Otherwise there will always be a race condition that is more likely to occur on larger files, of course.

This cannot "really" be fixed, imho, without doing some locking to first obtain the checksum of a file which will slow down processing quite a lot. It often is not so hard to have the ingestion part done in a more robust way to exclude duplicate files before they are even transferred to the server.

@eikek eikek added the question Further information is requested label Feb 6, 2024
@JaCoB1123
Copy link
Contributor Author

I guessed that it wouldn't be as easy. I thought if there was a table of running jobs and that contained the hash, you could check this one before running a job (or even make it the primary key?). That would probably when having multiple files for one document though.

Another solution might be to check for duplicates before inserting after the job. Then the job might have been wasted but at least no duplicates would be introduced. Would that maybe be a minor change and still handle most problems?

Decreasing the number of parallel processes to 1 is way easier though. I'll try how much it influences the time when adding documents.

@eikek
Copy link
Owner

eikek commented Feb 7, 2024

The hash is computed, but this can take a while and there will be race conditions if there are multiple parallel processes doing this. A job could be anything and could be tasked to process multiple files (that means it must be part of the task doing the work). But you are right, there are of course ways for preventing this better.

@eikek eikek removed the question Further information is requested label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants