Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchAnnotateFiles failing silently (and taking php thread with it) #7288

Open
James-THEA opened this issue May 3, 2024 · 1 comment
Open

Comments

@James-THEA
Copy link

James-THEA commented May 3, 2024

Environment details

  • OS: Amazon Linux 2023
  • PHP version: 8.2.15
  • Package name and version: v1.9.0

Steps to reproduce

  1. Use this file:
    faraone2005 (1).pdf

  2. Request pages 1-10
    a. Two batches of 5 pages. It works if I do only 1-9.

More context:
I have a setup to parse PDFs that relies on the Google Cloud Vision API. It has worked for the past several months, and anecdotally this is a new issue. There is no error thrown, and the PHP thread just dies.

Moreover, the issue doesn't exist in all my environments. Locally, everything works great (PHP version 8.2.4). On an Amazon Beanstalk server it works as well (some versions as listed above). The issue exists on both new and old servers that we have spun up. That means there is a possible solution of finding the discrepancy between the servers and updating the problem; however, I still think this should be filed as a bug.

I have added memory usage logging, and nothing appears that crazy (>100MB). It does spike on the first request using batchAnnotateFiles and then dies on the second request, so it is possible it spikes again (as I strongly suspect a memory limit is the problem).

I found this bug report: https://www.googlecloudcommunity.com/gc/AI-ML/Vision-AI-OCR-Internal-server-error-Failed-to-process-features/m-p/735441

It looks almost identical to my issue, but it is for Vision AI, so the fix is not applicable

Code example

A little edited for brevity, but I can confirm it still has the problem.

private function myFunction($filePath, int $startingPage, int $lastPage): FileUploadResponse {
        $pdfContent = \Storage::get($filePath);
        $inputConfig = (new InputConfig())
            ->setMimeType('application/pdf')
            ->setContent($pdfContent);
        $feature = (new Feature())->setType(Type::DOCUMENT_TEXT_DETECTION);

        $totalPages = range($startingPage + 1, $lastPage + 1);
        $pageChunks = array_chunk($totalPages, 5);
        $overallText = '';
        $maxLength = self::MAX_UPLOAD_TEXT_LENGTH;        
        
        for ($chunk = 0; $chunk < count($pageChunks); $chunk++) {
            try {
                $imageAnnotator = new ImageAnnotatorClient(['credentials' => 'redacted']);
                $pages = $pageChunks[$chunk];
                $annotateFileRequest = (new AnnotateFileRequest())
                    ->setInputConfig($inputConfig)
                    ->setFeatures([$feature])
                    ->setPages($pages);
                try {
                    $response = $imageAnnotator->batchAnnotateFiles([$annotateFileRequest]); // request dies here
                } catch (\Exception $e) {
                    Logger(json_encode($e));
                }
                $responses = $response->getResponses()[0]->getResponses();

                for ($x = 0; $x < min(count($pages), count($responses)); $x++) {
                    $pageResponse = $responses[$x];
                    if ($pageResponse->hasError()) {
                        continue;
                    }
                    if ($pageResponse->getFullTextAnnotation() !== null) {
                        $overallText .= $pageResponse->getFullTextAnnotation()->getText();
                    }
                }
            } finally {
                $imageAnnotator->close();
                gc_collect_cycles();
            }
        }
        return new FileUploadResponse(text: $overallText);
    }
@James-THEA
Copy link
Author

Adding some follow up investigation:

  • If we decrease the batch size to 1-4 pages, it works
  • If we don't chunk by pages and make one request, it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant