Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog #31

Open
simobatt opened this issue Sep 6, 2018 · 0 comments

Comments

@simobatt
Copy link

simobatt commented Sep 6, 2018

Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
The statistics properties are included in the Glue table properties, however, it looks that Hive is not honoring it.

Glue migration script is capable of migrating table and partition statistics from the Hive Metastore, however, it appears that the migration script is escaping some of the characters thus making the statistics unusable in Glue catalog:
https://github.com/aws-samples/aws-glue-samples/blob/master/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L455


When I created a table in Hive meta store, the table and column statistics looked like the following:

Table Parameters:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","name":"true"}}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689


When I ran the migration script to migrate the Hive meta store to Glue catalog, the same statistics became the following. I have found the below statistics is unusable through Hive.

Table Parameters:
COLUMN_STATS_ACCURATE \{\"BASIC_STATS\"\:\"true\",\"COLUMN_STATS\"\:\{\"id\"\:\"true\",\"name\"\:\"true\"\}\}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689

I then manually modified table property (COLUMN_STATS_ACCURATE) in Glue console to the following and was able to convert 'COLUMN_STATS_ACCURATE' into a usable format

I didn't check the compatibility of the migrated statistics with the other EMR tools (Spark, Presto) and AWS services (Glue ETL job, Athena, Redshift Spectrum).

Regards,
Simone

@simobatt simobatt changed the title Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog. Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant