You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
I would like to kindly bring attention to a potential issue with the DocumentBatchingFunction, which is used to batch articles into a single JSON file when there are more than five articles in the S3 bucket. The function fetches all articles and batched article lists to the function runtime's local file system, deletes the copies in the bucket, and then uploads the newly generated article list in one json file. However, such a multi-step workflow is not idempotent; if the function runtime crashes after deleting any articles in the S3 bucket but before the new JSON file is uploaded, when the function retries, that article will not be found for processing. That is, some articles will be lost.
Suggested Fix:
To address this issue, please consider uploading the newly concatenated JSON file first before deleting any old articles. If a retry happens before the newly batched file is uploaded, since no documents have been lost, the workflow can start and execute normally with the original documents. If the batch has been uploaded and the runtime crashes, simply proceed to delete the old unbatched articles if they exist. Currently, the batch name is generated from the current time; however, if the name uses context.aws_request_id, which is constant across retries, the program can determine whether the batch has been uploaded.
Thank you for considering this feedback. I hope my suggestion proves helpful in enhancing the reliability and idempotency of the DocumentBatchingFunction. Please don't hesitate to reach out if you have any questions or concerns.
The text was updated successfully, but these errors were encountered:
Description:
I would like to kindly bring attention to a potential issue with the DocumentBatchingFunction, which is used to batch articles into a single JSON file when there are more than five articles in the S3 bucket. The function fetches all articles and batched article lists to the function runtime's local file system, deletes the copies in the bucket, and then uploads the newly generated article list in one json file. However, such a multi-step workflow is not idempotent; if the function runtime crashes after deleting any articles in the S3 bucket but before the new JSON file is uploaded, when the function retries, that article will not be found for processing. That is, some articles will be lost.
Suggested Fix:
To address this issue, please consider uploading the newly concatenated JSON file first before deleting any old articles. If a retry happens before the newly batched file is uploaded, since no documents have been lost, the workflow can start and execute normally with the original documents. If the batch has been uploaded and the runtime crashes, simply proceed to delete the old unbatched articles if they exist. Currently, the batch name is generated from the current time; however, if the name uses
context.aws_request_id
, which is constant across retries, the program can determine whether the batch has been uploaded.Thank you for considering this feedback. I hope my suggestion proves helpful in enhancing the reliability and idempotency of the DocumentBatchingFunction. Please don't hesitate to reach out if you have any questions or concerns.
The text was updated successfully, but these errors were encountered: