Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7713] Enforce ordering of fields during schema reconciliation #11154

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

  • Adds ability to get consistent ordering of fields during the schema reconciliation steps

Impact

  • Consistent ordering of fields within a schema can help users reduce potential issues with HMS or other metastores.

Risk level (write none, low medium or high below)

None, just reorders fields

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@the-other-tim-brown the-other-tim-brown changed the title Enforce ordering of fields during schema reconciliation [HUDI-7713] Enforce ordering of fields during schema reconciliation May 5, 2024
Schema expected = createRecord("reorderNestedFields",
createPrimitiveField("field1", Schema.Type.INT),
createPrimitiveField("field2", Schema.Type.INT),
createArrayField("field3", createRecord("reorderNestedFields.field3",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex can you confirm this is the expected naming after reconcile is run?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that doesn't look right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should it look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be "nestedRecord" I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label May 5, 2024
@jonvex jonvex self-assigned this May 6, 2024
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels May 20, 2024
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review May 21, 2024 02:04
InternalSchema targetInternalSchema = convert(targetSchema);
// Use existing fieldIds for consistent field ordering between commits when shouldReorderColumns is true
InternalSchema sourceInternalSchema = convert(sourceSchema, shouldReorderColumns ? targetInternalSchema.getNameToPosition() : Collections.emptyMap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only source schema? wny not reorder target schema too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target schema is the source for the ordering. In this code, the target schema is the existing table and the source is the incoming dataset

val canonicalizedSourceSchema = if (shouldCanonicalizeSchema) {
canonicalizeSchema(sourceSchema, latestTableSchema, opts)
canonicalizeSchema(sourceSchema, latestTableSchema, opts, !shouldReconcileSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't we reorder columns when reconcile schema is true? Can you please add a note in the comment regarding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex advised to do this. I think it is because reconcile is schema on read?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile is not necessarily dependent on schema on read. I think the reason might have been to not conflict schema reconciliation rules incase that is enabled. @jonvex to clarify. Whatever be the reason, let's add a comment for reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile has been deprecated, so we shouldn't modify it's behavior

import static org.apache.hudi.internal.schema.utils.InternalSchemaUtils.createFullName;

/**
* Schema visitor to produce name -> id map for internalSchema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Schema visitor to produce name -> id map for internalSchema.
* Schema visitor to produce name -> position map for internalSchema, where position indicates position of the field in the schema.

@@ -67,6 +68,10 @@ public Map<String, Integer> buildNameToId(Type type) {
return visit(type, new NameToIDVisitor());
}

Map<String, Integer> buildNameToPosition(Type type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: Do we use the InternalSchemaBuilder even when schema on read is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal schema seems to provide some nice utilities but I was not familiar with it before this change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use internal schema even when schema on read is disabled. We use it to add null for missing columns, promote incoming batch if it can be promoted to the table schema, and also to fix the ordering of unions

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@jonvex can you also review?

@@ -169,20 +169,35 @@ class TestBasicSchemaEvolution extends HoodieSparkClientTestBase with ScalaAsser
// 2. Write 2d batch with another schema (added column `age`)
//

val secondSchema = StructType(
val secondInputSchema = StructType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the behavior be the same when reconcile is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but reconcile is not always enabled

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants