Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7652] Add new HoodieMergeKey API to support simple and composite keys #11077

Merged
merged 9 commits into from
May 17, 2024

Conversation

codope
Copy link
Member

@codope codope commented Apr 23, 2024

Change Logs

This PR introduces a new class hierarchy for handling merge keys in a more flexible and decoupled manner. It adds the HoodieMergeKey interface, along with two implementations: HoodieSimpleMergeKey and HoodieCompositeMergeKey. This design allows us to extend key-based merge strategies easily.

Motivation

The need for introducing a new merge key handling mechanism was driven by the requirement to support different types of keys (simple and complex) without overloading the existing HoodieKey class, which is central to the write path. By segregating merge key handling into its own hierarchy, we avoid potential conflicts and keep modifications localised, improving the maintainability of the code.

Changes

  1. HoodieMergeKey: New API to ensure consistent handling including simple keys and composite keys. It includes methods for retrieving the key and partition path.
  2. HoodieSimpleMergeKey: Wraps HoodieKey and implements the HoodieMergeKey interface for simple scenarios where the key is a string.
  3. HoodieCompositeMergeKey: Implements the HoodieMergeKey interface but allows for complex types as keys, enhancing flexibility for scenarios where a simple string key is not sufficient.
  4. HoodieMergeKeyBasedRecordMerger: A new implementation of HoodieRecordMerger based on HoodieMergeKey. If the merge keys are of type HoodieCompositeMergeKey, then it returns the older and newer records. Otherwise, it calls the merge method from the parent class.
  5. HoodieMergedLogRecordScanner: Changes to merge based on HoodieMergeKey.
  6. Unit tests for the new merger.

These changes do not affect existing functionalities that do not rely on merge keys. It introduces additional classes that are used explicitly for new functionalities involving various key types in merging operations. This ensures minimal to no risk for existing processes.

Impact

Enhancing the flexibility and robustness of our key-based merge strategies. It helps in keeping our codebase scalable and maintainable, allowing easy extensions and modifications in the future.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Apr 23, 2024
@codope
Copy link
Member Author

codope commented May 9, 2024

@danny0405 Based on our discussion, I have removed the HoodieMergeKey API and created a subclass HoodieMetadataMergedLogRecordScanner which works with ExternalSpillableMap<Serializable, HoodieRecord> (instead of string keys). It is used only for HoodieMetadataLogRecordReader and introduced a HoodieMetadataRecordMerger. The merger currently mimics the super class, but it will change for secondary index in subsequent PR. Please review this PR again.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I have left some comments.

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels May 16, 2024
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels May 17, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit e0ca6dd into apache:master May 17, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-1.0.0-beta2 size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants