Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying multiple files to watchedFolder causes app to grab zero byte files #1214

Closed
seakrebel opened this issue May 15, 2024 · 6 comments · Fixed by #1282
Closed

Copying multiple files to watchedFolder causes app to grab zero byte files #1214

seakrebel opened this issue May 15, 2024 · 6 comments · Fixed by #1282
Assignees

Comments

@seakrebel
Copy link

seakrebel commented May 15, 2024

I copied around 200 PDFs into the watchedFolder, and realized there were more than 350 PDFs in the processing folder which I found weird. Then I saw many of the PDFs are "duplicated" and some of them have "zero bytes" size.

As I suspected the app was starting the process before the files were completely copied over.
I confirmed this by copying only 20 PDFs in the watchedFolder - same behavior.

Wish there was a way to tell the app to wait a bit before processing the file. Similar to the variable PAPERLESS_CONSUMER_INOTIFY_DELAY in paperless-ngx.

The only workaround I found so far, is to stop the container, copy over the files, and then start the container again.

@Frooodle
Copy link
Member

A good callout and bug
I will work on this over weekend

@kkdlau
Copy link
Contributor

kkdlau commented May 18, 2024

Hi @Frooodle, i think this is a good issue that can assign someone like me who wish to contribute to open source 😃

My initial idea of solving this issue is to update collectFilesForProcessing to ensure we only collects files that are fully copied, either by checking if the size is growing, or using some os level features (e.g. lsof)

let me if there is comment for the solution 😆

@Frooodle
Copy link
Member

@kkdlau hows this going?

@seakrebel
Copy link
Author

Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.

PipelineDirectoryProcessor.java

// [...]
import java.util.concurrent.TimeUnit;
// [...]
public class PipelineDirectoryProcessor {
    // [...]

    private static final long STABILITY_CHECK_DELAY = 1000; // 1 second
    private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times

    private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException {
        try (Stream<Path> paths = Files.list(dir)) {
            if ("automated".equals(operation.getParameters().get("fileInput"))) {
                return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path))
                            .map(Path::toFile)
                            .toArray(File[]::new);
            } else {
                String fileInput = (String) operation.getParameters().get("fileInput");
                return new File[] { new File(fileInput) };
            }
        }
    }

    private boolean isFileStable(Path path) throws IOException {
        long initialSize = Files.size(path);
        for (int i = 0; i < STABILITY_CHECK_COUNT; i++) {
            try {
                TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                throw new IOException("Thread interrupted during stability check", e);
            }
            long newSize = Files.size(path);
            if (initialSize != newSize) {
                return false;
            }
        }
        return initialSize > 0; // Also ensuring the file is not zero bytes
    }
    // [...]
}
// [...]

@kkdlau
Copy link
Contributor

kkdlau commented May 24, 2024

@kkdlau hows this going?

Hi, was busy with my full-time work 😞
But I already have the draft of the PR
Just need to go through couple of regression testing to ensure it doesn't break the existing features

Will create a PR tnt (APAC time)👍🏻

@kkdlau
Copy link
Contributor

kkdlau commented May 24, 2024

Here is an example. Haven't tested it. Also not an java expert. But could be leading into right direction.

PipelineDirectoryProcessor.java

// [...]

import java.util.concurrent.TimeUnit;

// [...]

public class PipelineDirectoryProcessor {

    // [...]



    private static final long STABILITY_CHECK_DELAY = 1000; // 1 second

    private static final long STABILITY_CHECK_COUNT = 5; // Check 5 times



    private File[] collectFilesForProcessing(Path dir, Path jsonFile, PipelineOperation operation) throws IOException {

        try (Stream<Path> paths = Files.list(dir)) {

            if ("automated".equals(operation.getParameters().get("fileInput"))) {

                return paths.filter(path -> !Files.isDirectory(path) && !path.equals(jsonFile) && isFileStable(path))

                            .map(Path::toFile)

                            .toArray(File[]::new);

            } else {

                String fileInput = (String) operation.getParameters().get("fileInput");

                return new File[] { new File(fileInput) };

            }

        }

    }



    private boolean isFileStable(Path path) throws IOException {

        long initialSize = Files.size(path);

        for (int i = 0; i < STABILITY_CHECK_COUNT; i++) {

            try {

                TimeUnit.MILLISECONDS.sleep(STABILITY_CHECK_DELAY);

            } catch (InterruptedException e) {

                Thread.currentThread().interrupt();

                throw new IOException("Thread interrupted during stability check", e);

            }

            long newSize = Files.size(path);

            if (initialSize != newSize) {

                return false;

            }

        }

        return initialSize > 0; // Also ensuring the file is not zero bytes

    }

    // [...]

}

// [...]

Thanks for the idea 👍🏻
My draft is quite similar except for isFileStable implementation
will share more details when I open the PR

kkdlau added a commit to kkdlau/Stirling-PDF that referenced this issue May 24, 2024
kkdlau added a commit to kkdlau/Stirling-PDF that referenced this issue May 29, 2024
Frooodle added a commit that referenced this issue May 30, 2024
#1214 Only take pdf that are good for processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants