Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBZ-6964 Temporary fix schema for MySQL "Geometry" types conflicting with Avro schema registries (Confluent, Apicurio). #5411

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rk3rn3r
Copy link
Member

@rk3rn3r rk3rn3r commented Mar 20, 2024

It seems that schema registries don't incorporate schema parameters correctly leading to that only the first row of a Geometry type is used as schema reference. If a single topic/table has multiple fields/rows of a geo type than only one schema is referenced for all fields, even when the real schema of those fields differ.

closes https://issues.redhat.com/browse/DBZ-6964

…with Avro schema registries (Confluent, Apicurio). It seems that schema registries don't incorporate schema parameters correctly leading to that only the first row of a Geometry type is used as schema reference. If a single topic/table has multiple fields/rows of a geo type than only one schema is referenced for all fields, even when the real schema of those fields differ.

closes https://issues.redhat.com/browse/DBZ-6964
…d deduplication with a Map with a key based on the SMT's fully-qualified class name and version.

+ minor cleanup

closes https://issues.redhat.com/browse/DBZ-7416
@rk3rn3r rk3rn3r requested a review from jpechane March 20, 2024 08:09
@rk3rn3r
Copy link
Member Author

rk3rn3r commented Mar 20, 2024

@jpechane Looking at the Cassandra PR that you mentioned in the last triaging call, I felt like it could make sense to not only do this for MySQL Geometry but maybe for other fields that potentially conflict?
There are some schema parameters that are handled by the different Avro registries (feels like some copied the other) like scale and precision values. The ones that we used to differ schemas of different fields are not incorporated in the cache key and storage, leading to those issues.
Maybe we can identify those datatype for all connectors?

In general I suggest this as a temporary fix as the schema name for those field look ugly.
Maybe it is even better to use the logical coordinate (server.schema.table.field) as schema names until it is fixed?
Wdyt?

What also might be missing is a Apicurio client test for Postgres. Should I add one?

@@ -218,7 +231,7 @@ else if (i == 2) {
}
}

private DatabaseGeoDifferences databaseGeoDifferences(boolean mySql5) {
public static DatabaseGeoDifferences databaseGeoDifferences(boolean mySql5) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did not change this. This interface is still private, Github shows an outdated commit content here. 🤷

@mfvitale
Copy link
Member

mfvitale commented Mar 20, 2024

There are some schema parameters that are handled by the different Avro registries (feels like some copied the other) like scale and precision values.

Yes, issue was reported with https://issues.redhat.com/browse/DBZ-6836. Opened an issue to Confluent and Apicurio but no answer.

return SchemaBuilder.struct()
.name(LOGICAL_NAME)
.name(LOGICAL_NAME + "__" + columnName) // temporary fix for DBZ-6964
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the essence of the change. I don't feel super happy about it and looking at the Cassandra PR debezium/debezium-connector-cassandra#121 there might be other fields affected in general.
Looking at the registry code it seems that all types that can collide, collide when there are other parameters used than things like precision and scale. For example DECIMAL types are handled here:
https://github.com/Apicurio/apicurio-registry/blob/f2d1f06cae0b709acf3f6d4edb982f5775b50fa9/utils/converter/src/main/java/io/apicurio/registry/utils/converter/avro/AvroData.java#L886-L896

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making logical names dynamic will break the JDBC sink connector. If we go with this approach, we'll need to add a check for "prefixed geometry logical names" to sanitize the incoming data to correctly lookup and resolve target columns accordingly.

Copy link
Member Author

@rk3rn3r rk3rn3r Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naros I agree. But let me challenge this. Is this the right way to to it? Couldn't we use "__debezium.source.column.type" : "GEOMETRY", schema parameter instead? With this change we can guarantee that all parameters are stored in the registry. Then we can rely on our custom fields to decide on the Debezium datatype instead of relying on the default behavior of the registry. Reusing the name field from the schema is maybe not the best way to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could for Debezium events, but don't forget the JDBC sink can be used by non-Debezium producers.

We will need to look at the schema type no matter what for the latter use case; I just want to be mindful of any changes we make here and how that could impact performance, specifically given how hard Mario worked on improving the sink's performance metrics already; no need to walk that back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naros and resolve target columns accordingly

I think the jdbc sink resolves the column name from field.name() not from the schema, does it? The issue will be with identifying the correct data type:

Looking at the code in iGeneralDatabaseDialect#getSchemaType()

      if (!Objects.isNull(schema.name())) {
            final Type type = typeRegistry.get(schema.name());
            if (!Objects.isNull(type)) {
                LOGGER.trace("Schema '{}' resolved by name from registry to type '{}'", schema.name(), type);
                return type;
            }
        }
        if (!Objects.isNull(schema.parameters())) {
            final String columnType = schema.parameters().get("__debezium.source.column.type");
            if (!Objects.isNull(columnType)) {
                final Type type = typeRegistry.get(columnType);

The schema is loaded via the record's schema.name() (which will work because the schema comes from field.schema().
But in that first case typeRegistry.get(schema.name()); we would have to match only the first name or cleanup the key in the cache. I don't know, it is an unfortunate fix.

@rk3rn3r
Copy link
Member Author

rk3rn3r commented Mar 20, 2024

Thx @mfvitale! You can look at the linked code in my previous reply (the one after yours). It shows the code and which datatypes/fields get a "special" handling. It should be possible to fix the issue somewhere around there too to incorporate other parameters not only those "magical" ones. wdyt?

@mfvitale
Copy link
Member

Thx @mfvitale! You can look at the linked code in my previous reply (the one after yours). It shows the code and which datatypes/fields get a "special" handling. It should be possible to fix the issue somewhere around there too to incorporate other parameters not only those "magical" ones. wdyt?

Honestly, I don't know. My first clue of the issue was about this lines. https://github.com/confluentinc/schema-registry/blob/cede655d600161767be861f34c47516c76923594/avro-data/src/main/java/io/confluent/connect/avro/AvroData.java#L1172

@jpechane
Copy link
Contributor

@rk3rn3r Thanks for the PR. It is definitely omething that works but I am not keen in having it in the core - given we are effectivelly patching registry bug. Also it is unnecessary for JSON. Is there a chance the implementation can be change along the lines of https://debezium.zulipchat.com/#narrow/stream/348251-community-cassandra/topic/issues.20in.20pushing.20the.20row.20if.20we.20have.20Datatype.20.3A.20set.3Ctext.3E/near/421428439 and intead of having it in the core it would be an SMT?

@rk3rn3r
Copy link
Member Author

rk3rn3r commented Mar 21, 2024

@jpechane There is a MySQLSchemaFactory, similar to the fix done in the Cassandra PR, but I assume that, but it is imo not used at all at the moment. Should I move the fix there? Imo other data types will be affected too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants