Handle duplicate documents during processing #2487

JaCoB1123 · 2024-02-05T08:46:59Z

I had duplicates in Docspell in some occasions because the duplicate check only handles processed documents. When I have a big document that takes long to process, and rerun dsc to upload local files, it will be added to the processing queue again.

eikek · 2024-02-05T19:16:59Z

Hi @JaCoB1123 this is possible by reducing the number of parallel processes to 1. Otherwise there will always be a race condition that is more likely to occur on larger files, of course.

This cannot "really" be fixed, imho, without doing some locking to first obtain the checksum of a file which will slow down processing quite a lot. It often is not so hard to have the ingestion part done in a more robust way to exclude duplicate files before they are even transferred to the server.

JaCoB1123 · 2024-02-06T21:42:52Z

I guessed that it wouldn't be as easy. I thought if there was a table of running jobs and that contained the hash, you could check this one before running a job (or even make it the primary key?). That would probably when having multiple files for one document though.

Another solution might be to check for duplicates before inserting after the job. Then the job might have been wasted but at least no duplicates would be introduced. Would that maybe be a minor change and still handle most problems?

Decreasing the number of parallel processes to 1 is way easier though. I'll try how much it influences the time when adding documents.

eikek · 2024-02-07T08:50:59Z

The hash is computed, but this can take a while and there will be race conditions if there are multiple parallel processes doing this. A job could be anything and could be tasked to process multiple files (that means it must be part of the task doing the work). But you are right, there are of course ways for preventing this better.

eikek added the question Further information is requested label Feb 6, 2024

eikek removed the question Further information is requested label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle duplicate documents during processing #2487

Handle duplicate documents during processing #2487

JaCoB1123 commented Feb 5, 2024

eikek commented Feb 5, 2024

JaCoB1123 commented Feb 6, 2024

eikek commented Feb 7, 2024

Handle duplicate documents during processing #2487

Handle duplicate documents during processing #2487

Comments

JaCoB1123 commented Feb 5, 2024

eikek commented Feb 5, 2024

JaCoB1123 commented Feb 6, 2024

eikek commented Feb 7, 2024