Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Support Glue catalog for iceberg tables using UniForm #2922

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maninc
Copy link

@maninc maninc commented Apr 19, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

  • Adds GlueCatalog support for Iceberg tables using Uniform.
  • Accepts GlueCatalog as catalog-impl spark configuration property.
spark-shell  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension  \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog  \
--conf spark.databricks.delta.allowArbitraryProperties.enabled=true \
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ 
--conf spark.sql.catalog.spark_catalog.warehouse=s3://warehouse/path/delta/  \
--conf spark.jars=/home/hadoop/delta-iceberg_2.12-3.1.0.jar,/home/hadoop/url-connection-client-2.20.160.jar

How was this patch tested?

  • Created Delta table with uniform property - 'delta.universalFormat.enabledFormats' = 'iceberg'
  • Verified Iceberg table created in Glue
  • Successfully read data from Iceberg table

Does this PR introduce any user-facing changes?

Yes, Users can use GlueCatalog for saving Iceberg tables to Aws Glue

@lzlfred
Copy link
Contributor

lzlfred commented Apr 19, 2024

For context, the existing uniform writes single table entry in HMS so that the table can be both expressed as Delta and Iceberg:
Iceberg uses table property "table_type"->"iceberg", and "metadata_location" -> /foo/bar/metadata/v0-uuid.json
Delta uses table property "provider" -> "delta" and "location" -> "foo/bar", where Delta looks for _delta_log under the location.

Does this PR achieve the same for Glue, or the glue table is only an iceberg table ?

@maninc
Copy link
Author

maninc commented Apr 22, 2024

For context, the existing uniform writes single table entry in HMS so that the table can be both expressed as Delta and Iceberg: Iceberg uses table property "table_type"->"iceberg", and "metadata_location" -> /foo/bar/metadata/v0-uuid.json Delta uses table property "provider" -> "delta" and "location" -> "foo/bar", where Delta looks for _delta_log under the location.

Does this PR achieve the same for Glue, or the glue table is only an iceberg table ?

Thank you for looking into this PR!

Yes, it does add both properties to Glue. To test this feature, I created a Delta table using Spark SQL and Inserted few rows into it. I was able to query the table using Delta & Iceberg catalog.

Output of Glue:getTable call of the table created using Delta UniForm.

    "SerdeInfo": {
      "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
      "Parameters": {
        "path": "s3://s3_table_location/tableName"
      }
    },
    "SortColumns": [],
    "StoredAsSubDirectories": false
  },
  "TableType": "EXTERNAL_TABLE",
  "Parameters": {
    "previous_metadata_location": "s3://s3_table_location/tableName/metadata/00006-2842a9d3-6775-409d-ba43-f24d577446f0.metadata.json",
    "delta.enableIcebergCompatV2": "true",
    "spark.sql.create.version": "3.5.0-amzn-0",
    "totalSize": "0",
    "delta-timestamp": "1712267687000",
    "delta.universalFormat.enabledFormats": "iceberg",
    "numFiles": "0",
    "metadata_location": "s3://s3_table_location/tableName/metadata/00007-477d289f-e6c1-4cc9-8e31-d257951a621e.metadata.json",
    "spark.sql.sources.schema": "{\"type\":\"struct\",\"fields\":[]}",
    "spark.sql.partitionProvider": "filesystem",
    "delta-version": "6",
    "spark.sql.sources.provider": "delta",
    "table_type": "ICEBERG"

@lzlfred
Copy link
Contributor

lzlfred commented Apr 23, 2024

Thanks for maninc ! the glue table info is super helpful and i can see how it works.
I will do more detailed code review hopefully end of this week or next week.

@lzlfred
Copy link
Contributor

lzlfred commented Apr 30, 2024

sorry that I have not get a chance to test and review this by myself yet. i will try to get to this ASAP.

@lzlfred
Copy link
Contributor

lzlfred commented May 21, 2024

@maninc trying to get back on this. I hit a blocker and need more info : whats the full cmds and deployment of this Spark ? is it EMR, or some custom deployment on EC2 ?

I miss the setup to hook up Glue to Spark. The provided spark-shell cmd uses DeltaCatalog which does not connect to Glue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants