Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parquet format #143

Open
fuyi opened this issue Feb 3, 2020 · 1 comment
Open

Support for parquet format #143

fuyi opened this issue Feb 3, 2020 · 1 comment

Comments

@fuyi
Copy link

fuyi commented Feb 3, 2020

Is there any plan to add Apache Parquet file format besides Avro?

@labianchin
Copy link
Collaborator

labianchin commented Feb 17, 2020

No plans yet.

It would be nice if DBeam supported more flexible output formats. A few examples other than Avro could be: Parquet, CSV, Proto, ...

DBeam is build on top of Beam SDK and should support more formats and runners. But so far its main use case has been writing Avro files to GCS via DataflowRunner. Only recently there has been better support for Parquet and other columnar formats on GCS.

I think it is worth to look into supporting Parquet in the coming months or years.


A few open questions when designing parquet support:

  • Should there be equivalent JdbcParquetJob, JdbcParquetIO, etc classes?
  • Should it be part of dbeam-core? Or a separate package? Or a separate project/repository?
  • Can Parquet support be built without the need for parquet-mr and hadoop? Those libraries bring many dependencies, some with vulnerabilities. It would be interesting to see if parquet support could be built using Arrow, some Trino libraries, or something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants