Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3292] fix(spark-connector): passing Gravitino catalog properties to spark connector #3270

Merged
merged 5 commits into from May 13, 2024

Conversation

FANNG1
Copy link
Contributor

@FANNG1 FANNG1 commented May 6, 2024

What changes were proposed in this pull request?

  1. passing Gravitino catalog properties with spark.bypass. prefix to spark connector
  2. refactor the catalog properties transform logic.

Why are the changes needed?

Fix: #3292

Does this PR introduce any user-facing change?

yes, add document

How was this patch tested?

  1. add UT
  2. test in local envriment

@FANNG1 FANNG1 marked this pull request as draft May 6, 2024 00:34
@FANNG1 FANNG1 changed the title [#3181] fix(spark-connector): failed to insert data to hive table in spark-sql [#3181] fix(spark-connector): passing Gravitino catalog properties to spark connector May 6, 2024
@FANNG1 FANNG1 force-pushed the insert branch 3 times, most recently from c7de58c to d76c394 Compare May 7, 2024 01:08
@FANNG1 FANNG1 changed the title [#3181] fix(spark-connector): passing Gravitino catalog properties to spark connector [#3292] fix(spark-connector): passing Gravitino catalog properties to spark connector May 7, 2024
@FANNG1 FANNG1 self-assigned this May 7, 2024
@FANNG1 FANNG1 force-pushed the insert branch 2 times, most recently from e8d26af to 0c87206 Compare May 7, 2024 09:29
@FANNG1 FANNG1 marked this pull request as ready for review May 7, 2024 09:58
@FANNG1
Copy link
Contributor Author

FANNG1 commented May 8, 2024

@jerryshao @qqqttt123 @yuqi1129 @diqiu50 @caican00 please help to review this PR when you have time, thanks

@FANNG1
Copy link
Contributor Author

FANNG1 commented May 8, 2024

KyuubiHiveConnectorDelegationTokenProvider needs to inject catalog properties to spark conf, but I'm not sure does Gravitino provide a self managed token provider or reuse the kyuubi token provider. @qqqttt123

static final String ICEBERG_CATALOG_BACKEND_HIVE = CatalogUtil.ICEBERG_CATALOG_TYPE_HIVE;

static final String GRAVITINO_ICEBERG_CATALOG_BACKEND_JDBC = "jdbc";
static final String ICEBERG_CATALOG_BACKEND_JDBC = "jdbc";

private IcebergPropertiesConstants() {}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to distinguish which constants here are for the catalog and which are for the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are all catalog properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You suggest to add CATALOG to all property names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that, the class name IcebergCatalogPropertiesConstants is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep current file name, because it maybe support table properties constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works; find a way to distinguish them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works; find a way to distinguish them.

added CATALOG to property names.

@qqqttt123
Copy link
Contributor

KyuubiHiveConnectorDelegationTokenProvider needs to inject catalog properties to spark conf, but I'm not sure does Gravitino provide a self managed token provider or reuse the kyuubi token provider. @qqqttt123

Prefer self-managed tokenProvider. Because the tokenProvider will be used for Iceberg, Hive, Paimon.

@FANNG1
Copy link
Contributor Author

FANNG1 commented May 10, 2024

self-managed tokenProvider

Does self-managed tokenProvider needs catalog properties from spark conf? If not, I will keep current implement and not inject catalog properties to spark conf.

@@ -42,6 +42,9 @@ The Hive catalog supports creating, updating, and deleting databases and tables

When you use the Gravitino with Trino. You can pass the Trino Hive connector configuration using prefix `trino.bypass.`. For example, using `trino.bypass.hive.config.resources` to pass the `hive.config.resources` to the Gravitino Hive catalog in Trino runtime.

When you use the Gravitino with Spark. You can pass the Spark Hive connector configuration using prefix `spark.bypass.`. For example, using `spark.bypass.hive.config.resources` to pass the `hive.config.resources` to the Spark Hive connector in Spark runtime.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure these two configs spark.bypass.hive.config.resources and hive.config.resources works in Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive.config.resources is just a config example, change it to a real config like hive.exec.dynamic.partition.mode?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use a non-existent config as an example, this is really confusing. Please think of as a user who reads your document, does he really understand what you want to describe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, fixed

@@ -43,6 +43,9 @@ Any properties not defined by Gravitino with `gravitino.bypass.` prefix will pas

When you use the Gravitino with Trino. You can pass the Trino Iceberg connector configuration using prefix `trino.bypass.`. For example, using `trino.bypass.iceberg.table-statistics-enabled` to pass the `iceberg.table-statistics-enabled` to the Gravitino Iceberg catalog in Trino runtime.

When you use the Gravitino with Spark. You can pass the Spark Iceberg connector configuration using prefix `spark.bypass.`. For example, using `spark.bypass.iceberg.config.resources` to pass the `iceberg.config.resources` to the Spark Iceberg connector in Spark runtime.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

|---------------------------------|-----------------------------|----------------------------|---------------|
| `metastore.uris` | `hive.metastore.uris` | Hive metastore uri address | 0.5.0 |

Catalog properties with prefix `spark.bypass.` are passed to Spark Hive connector. For example, using `spark.bypass.config.resources` to pass the `config.resources` to the Spark Hive connector.
Copy link
Collaborator

@jerryshao jerryshao May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we use "Gravitino catalog property name" or "Spark catalog property name", you bring in new concepts without any explanation, how do users use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with Gravitino catalog property names

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as I mentioned here, you add two tables in here and iceberg for a bunch of properties without explaining what are they and how user uses them, so what actually do you want to deliver to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some description, please help to review again

* @param properties Gravitino catalog properties.
* @return properties for the Spark connector.
*/
Map<String, String> transformSparkCatalogProperties(Map<String, String> properties);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of transform, why don't you use toXXX to align with below interfaces if the semantics are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

HashMap<String, String> all = new HashMap<>(options);
all.put(GravitinoSparkConfig.SPARK_HIVE_METASTORE_URI, metastoreUri);
Map<String, String> all =
HivePropertiesConverter.getInstance().toSparkCatalogProperties(options, properties);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use getPropertiesConverter().toXXX here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@FANNG1
Copy link
Contributor Author

FANNG1 commented May 11, 2024

@jerryshao @diqiu50 , all comments are addressed, please help review again

/**
* Transform properties from Gravitino catalog properties to Spark connector properties.
*
* <p>This interface focus on the catalog specific transform logic, the common logic are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This interface focuses..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jerryshao
Copy link
Collaborator

@FANNG1 just one minor issue, please fix it.

@FANNG1
Copy link
Contributor Author

FANNG1 commented May 11, 2024

@FANNG1 just one minor issue, please fix it.

fixed

@FANNG1
Copy link
Contributor Author

FANNG1 commented May 13, 2024

@jerryshao could you help to review again? thanks

@jerryshao jerryshao merged commit 013021d into datastrato:main May 13, 2024
22 checks passed
github-actions bot pushed a commit that referenced this pull request May 13, 2024
… spark connector (#3270)

### What changes were proposed in this pull request?
1. passing Gravitino catalog properties with `spark.bypass.` prefix to
spark connector
2. refactor the catalog properties transform logic.

### Why are the changes needed?

Fix: #3292

### Does this PR introduce _any_ user-facing change?
yes, add document

### How was this patch tested?
1. add UT
3. test in local envriment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] [spark-connector] support passing catalog properties to spark connector
4 participants