Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][CDCSOURCE] source with kafka debezium json format #3341

Open
2 of 3 tasks
ysmintor opened this issue Mar 29, 2024 · 8 comments
Open
2 of 3 tasks

[Feature][CDCSOURCE] source with kafka debezium json format #3341

ysmintor opened this issue Mar 29, 2024 · 8 comments
Assignees
Labels
Discussing The problem is being discussed New Feature New feature
Milestone

Comments

@ysmintor
Copy link

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

目前存量多源数据库总计有几千张表,数量非常多,有对接的 Kafka 采用了CDC 的方式采集增量数据,大部分格式直接是 debezium json 格式,但由于表数量大,一个 Kafka topic 里会有数量不等的表。没有权限直接对接几千个业务库,而且也不是 MySQL,看Dinky 官方给的都是 MySQLCDC,还有 OracleCDC等。

目前要从 Kafka 消费来实现整库同步,一个topic 会有多张表,这种 Kafka source with debezium json format 希望能够作为一个数据源加入。


English translation

Currently, the existing multi-source database has thousands of tables in total, a huge number. The connected Kafka uses the CDC method to collect incremental data, and most of the formats are in debezium json format, but due to the large number of tables, a Kafka topic will have an unequal number of tables. There is no permission to directly connect to thousands of business libraries, and it is not MySQL. Dinky's official documents are all MySQLCDC and OracleCDC.

Currently, in order to implement full-database synchronization from Kafka consumption, a topic will have multiple tables. This Kafka source with debezium json format is expected to be added as a data source.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ysmintor ysmintor added New Feature New feature Waiting for reply Waiting for reply labels Mar 29, 2024
Copy link

Hello @ysmintor, this issue is about CDC/CDCSOURCE, so I assign it to @aiwenmo. If you have any questions, you can comment and reply.

你好 @ysmintor, 这个 issue 是关于 CDC/CDCSOURCE 的,所以我把它分配给了 @aiwenmo。如有任何问题,可以评论回复。

@Zzm0809
Copy link
Contributor

Zzm0809 commented Apr 1, 2024

直接使用 kafka 连接器即可 本身都是 json

@aiwenmo
Copy link
Contributor

aiwenmo commented Apr 1, 2024

Is your requirement to split the data and write it to different tables?

@aiwenmo aiwenmo removed the Waiting for reply Waiting for reply label Apr 1, 2024
@ysmintor
Copy link
Author

ysmintor commented Apr 2, 2024

Is your requirement to split the data and write it to different tables?

@aiwenmo Yes. One Kafka topic may have multiple cdc tables. And need to write into different tables. I also think we can conusme multile Kafka topics corresponding one table case.

@ysmintor
Copy link
Author

ysmintor commented Apr 4, 2024

@aiwenmo @Zzm0809

I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.

Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseAction and KafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.

I know these features may cause a bit code and structure changes, and please at your schedule to think that.

@Zzm0809
Copy link
Contributor

Zzm0809 commented Apr 7, 2024

@aiwenmo @Zzm0809

I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.我实际上已经了解了 Flink CDC 和 Hudi 解决方案。但与我的团队一起实现从 Kafka CDC(我在 Kafka 中将其称为 Debezium json)到 Hudi 或其他数据库的连接器似乎有点困难。

Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseActionKafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.最近我花了一些时间练习 Apache Paimon CDC 对 Kafka CDC 的摄取,之后我认为这可能是我们的一个解决方案,因为 Apache Paimon 几天前已经成为 Apache 孵化的顶级项目。所以我想知道您是否可以实现这个 Kafka CDC 源连接器或吸收他们的 KafkaSyncDatabaseAction 和 KafkaSyncTableAction 实现,或者只是将其包装到 Dinky 上的 CDCSOURCE 任务中。

I know these features may cause a bit code and structure changes, and please at your schedule to think that.我知道这些功能可能会导致代码和结构发生一些变化,请在您的日程安排中考虑这一点。

Do you have the energy to fulfill this requirement?

@ysmintor
Copy link
Author

ysmintor commented Apr 7, 2024

@aiwenmo @Zzm0809
I have actually take a look of Flink CDC and Hudi solutions. But it seems a bit hard to implement a connector from Kafka CDC (somethings I called it as debezium json in Kafka) to Hudi or other databases with my team group.我实际上已经了解了 Flink CDC 和 Hudi 解决方案。但与我的团队一起实现从 Kafka CDC(我在 Kafka 中将其称为 Debezium json)到 Hudi 或其他数据库的连接器似乎有点困难。
Recently I take some time to practice with Apache Paimon CDC ingestion of Kafka CDC, after that I thought it might a solution for us, as Apache Paimon serveral days ago became a Top Project of Apache graduated from incubation. So I wonder whether you can implement this Kafka CDC source connector or absorbe their implementation of KafkaSyncDatabaseActionKafkaSyncTableAction or just wrap it into a CDCSOURCE task on Dinky.最近我花了一些时间练习 Apache Paimon CDC 对 Kafka CDC 的摄取,之后我认为这可能是我们的一个解决方案,因为 Apache Paimon 几天前已经成为 Apache 孵化的顶级项目。所以我想知道您是否可以实现这个 Kafka CDC 源连接器或吸收他们的 KafkaSyncDatabaseAction 和 KafkaSyncTableAction 实现,或者只是将其包装到 Dinky 上的 CDCSOURCE 任务中。
I know these features may cause a bit code and structure changes, and please at your schedule to think that.我知道这些功能可能会导致代码和结构发生一些变化,请在您的日程安排中考虑这一点。

Do you have the energy to fulfill this requirement?

Sorry, I do not have resources to implement this feature.

@Zzm0809 Zzm0809 added this to the Roadmap milestone Apr 17, 2024
@Zzm0809 Zzm0809 added the Discussing The problem is being discussed label Apr 17, 2024
@aiwenmo
Copy link
Contributor

aiwenmo commented Apr 18, 2024

I am willing to submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussing The problem is being discussed New Feature New feature
Projects
Status: Doing
Development

No branches or pull requests

3 participants