New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Controller][all] Do not truncate Kafka topic immediately for fatal d… #937
[Controller][all] Do not truncate Kafka topic immediately for fatal d… #937
Conversation
…ata validation error before EOP When the ingestion of a new store version is failed, today, we truncate the Kafka topic of the store version by updating its retention time to a small value (15 seconds), specified by DEPRECATED_TOPIC_RETENTION_MS. This is fine for regular failures but for fatal data validation errors, where it indicates critical issues happened during the batch push period (before EOP), truncating the topic too early can prevent us from finding the root cause, simply because the kafka data is gone. This change adjusts the retention time (to 5 days) when fatal data validation errors is identified, so that we can have enough time to investigate.
Thanks Lei! Did a quick scan on this diff. I wonder if these topics would eventually be picked up by TopicCleanUpService? Although they would become empty after 5 days, I am worried if these topics will be leaking since then and ignored by TopicCleanUpService even if the store has breached "minNumberOfUnusedKafkaTopicsToPreserve". I expected there should be some changes in "VeniceHelixAdmin#isTopicTruncatedBasedOnRetention" or "TopicManager#isRetentionBelowTruncatedThreshold". Please don't hesitate to correct me though. |
Thanks Min for the review, you are right on this point. But it seems not quite easy to integrate with the today's If we want to implement the Another simpler way is to temporarily increase |
@lluwm
Would the above strategy work? |
@gaojieliu, thanks for the suggestions and yeah I think it should work. For a failed version caused by fatal-DIV error that needs data preservation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, left a minor comment.
services/venice-controller/src/main/java/com/linkedin/venice/controller/Admin.java
Outdated
Show resolved
Hide resolved
…kaTopic interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change!
@gaojieliu, thanks a lot for the review. Proceed to merge the code. |
…ata validation error before EOP
When the ingestion of a new store version is failed, today, we truncate the Kafka topic of the store version by updating its retention time to a small value (15 seconds), specified by
DEPRECATED_TOPIC_RETENTION_MS
.This is fine for regular failures but for fatal data validation errors, where it indicates critical issues happened during the batch push period (before EOP), truncating the topic too early can prevent us from finding the root cause, simply because the kafka data is gone.
This change adjusts the retention time (to 2 days) when fatal data validation errors is identified, so that we can have enough time to investigate.
Meanwhile, there is additional added to the
TopicCleanupService
in the following way, even if a topic's retention time is >DEPRECATED_TOPIC_MAX_RETENTION_MS
, it can still be consider for deletion, if:DEPRECATED_TOPIC_RETENTION_MS
) days.venice_system_store_push_job_details_store
) and check it's already more than 2 days (DEPRECATED_TOPIC_RETENTION_MS
), if yes, delete it.How was this PR tested?
Does this PR introduce any user-facing changes?