Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idempotency Concern in DocumentBatchingFunction #11

Open
NullPointer4096 opened this issue Apr 16, 2023 · 0 comments
Open

Idempotency Concern in DocumentBatchingFunction #11

NullPointer4096 opened this issue Apr 16, 2023 · 0 comments

Comments

@NullPointer4096
Copy link

Description:
I would like to kindly bring attention to a potential issue with the DocumentBatchingFunction, which is used to batch articles into a single JSON file when there are more than five articles in the S3 bucket. The function fetches all articles and batched article lists to the function runtime's local file system, deletes the copies in the bucket, and then uploads the newly generated article list in one json file. However, such a multi-step workflow is not idempotent; if the function runtime crashes after deleting any articles in the S3 bucket but before the new JSON file is uploaded, when the function retries, that article will not be found for processing. That is, some articles will be lost.

Suggested Fix:
To address this issue, please consider uploading the newly concatenated JSON file first before deleting any old articles. If a retry happens before the newly batched file is uploaded, since no documents have been lost, the workflow can start and execute normally with the original documents. If the batch has been uploaded and the runtime crashes, simply proceed to delete the old unbatched articles if they exist. Currently, the batch name is generated from the current time; however, if the name uses context.aws_request_id, which is constant across retries, the program can determine whether the batch has been uploaded.

Thank you for considering this feedback. I hope my suggestion proves helpful in enhancing the reliability and idempotency of the DocumentBatchingFunction. Please don't hesitate to reach out if you have any questions or concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant