Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Controller][all] Do not truncate Kafka topic immediately for fatal d… #937

Conversation

lluwm
Copy link
Contributor

@lluwm lluwm commented Apr 9, 2024

…ata validation error before EOP

When the ingestion of a new store version is failed, today, we truncate the Kafka topic of the store version by updating its retention time to a small value (15 seconds), specified by DEPRECATED_TOPIC_RETENTION_MS.

This is fine for regular failures but for fatal data validation errors, where it indicates critical issues happened during the batch push period (before EOP), truncating the topic too early can prevent us from finding the root cause, simply because the kafka data is gone.

This change adjusts the retention time (to 2 days) when fatal data validation errors is identified, so that we can have enough time to investigate.

Meanwhile, there is additional added to the TopicCleanupService in the following way, even if a topic's retention time is > DEPRECATED_TOPIC_MAX_RETENTION_MS, it can still be consider for deletion, if:

  • If topic retention is 2 (DEPRECATED_TOPIC_RETENTION_MS) days.
  • If The topic is a version topic.
  • Get topic creation time (from venice_system_store_push_job_details_store) and check it's already more than 2 days (DEPRECATED_TOPIC_RETENTION_MS), if yes, delete it.

How was this PR tested?

  • CI
  • Added new integration and unit tests

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

…ata validation error before EOP

When the ingestion of a new store version is failed, today, we truncate the Kafka topic of the
store version by updating its retention time to a small value (15 seconds), specified by
DEPRECATED_TOPIC_RETENTION_MS.

This is fine for regular failures but for fatal data validation errors, where it indicates critical issues
happened during the batch push period (before EOP), truncating the topic too early can prevent us from
finding the root cause, simply because the kafka data is gone.

This change adjusts the retention time (to 5 days) when fatal data validation errors is identified, so
that we can have enough time to investigate.
@huangminchn
Copy link
Contributor

Thanks Lei! Did a quick scan on this diff. I wonder if these topics would eventually be picked up by TopicCleanUpService? Although they would become empty after 5 days, I am worried if these topics will be leaking since then and ignored by TopicCleanUpService even if the store has breached "minNumberOfUnusedKafkaTopicsToPreserve".

I expected there should be some changes in "VeniceHelixAdmin#isTopicTruncatedBasedOnRetention" or "TopicManager#isRetentionBelowTruncatedThreshold". Please don't hesitate to correct me though.

@lluwm
Copy link
Contributor Author

lluwm commented Apr 10, 2024

Thanks Min for the review, you are right on this point. But it seems not quite easy to integrate with the today's TopicCleanUpService and let me explain why here, the new requirement for the cleanup service is that we want to preserve the topic for longer time (e.g. 5 days) for debugging purpose and only after that period is passed, the topic is then permitted to be removed. Such wait-then-delete semantics seems missing from today's implementation.

If we want to implement the wait semantics then it means that we need to record the timestamp-to-update-retention in some non-volatile media (disk/zk etc.) to survive a controller crash.

Another simpler way is to temporarily increase DEPRECATED_TOPIC_MAX_RETENTION_MS to a larger value (e.g. 6 days) after our debugging, so that cleanup service can pick it up and delete the topic in the following round. Then we change it back today's value.

@gaojieliu
Copy link
Contributor

@lluwm
Purely reading the comments from you and @huangminchn, regarding the lingering empty topic issues, I guess we can enhance TopicCleanupService in the following way:

  1. If topic retention is 5 days.
  2. If The topic is a version topic.
  3. Check the topic creation time, and if it is beyond 5 days + buffer (1-2 days), delete it.

Would the above strategy work?

@lluwm
Copy link
Contributor Author

lluwm commented Apr 24, 2024

@gaojieliu, thanks for the suggestions and yeah I think it should work.

For a failed version caused by fatal-DIV error that needs data preservation, TopicCleanupService would not consider to delete it until condition 3 is true, which is (5 + buffer) days later. For a failed version not related to DIV, it is the same logic as today.

Copy link
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, left a minor comment.

Copy link
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change!

@lluwm
Copy link
Contributor Author

lluwm commented May 6, 2024

@gaojieliu, thanks a lot for the review. Proceed to merge the code.

@lluwm lluwm merged commit d7997d7 into linkedin:main May 6, 2024
32 checks passed
@lluwm lluwm deleted the do_not_truncate_topic_for_FatalDataValidationException branch May 6, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants