Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glue catalog to Hive Metastore Migration script not working with partition table #15

Open
Kuntal-G opened this issue Feb 8, 2018 · 4 comments

Comments

@Kuntal-G
Copy link

Kuntal-G commented Feb 8, 2018

I'm running the script to migrate Glue catalog data crawled from hive style (key=value) partition data from s3 and then migrating to hive metstore(MySQL). And the partition that is getting created in the hive metastore is incorrect.
[+] https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration

Note: Looks like Glue catalog data crawled from partitioned S3 is fine, as launching New EMR cluster with the Glue catalog is working fine and partition information is correct. Also, Athena using the Glue is able to find the partition of the table properly. But the script migrating table information from glue catalog to metastore is getting messed up, hence creating totally wrong partition information in hive metastore.

** Please find the steps carried out: **

  1. S3 path from which crawler was executed (You can see the data is in proper layout or YYYY=value/mm=value)
a0999b1381a5:~ kuntalg$ aws s3 ls --recursive s3://kg-practice/elb_logging/test
2018-02-06 15:54:06          0 elb_logging/test/
2018-02-06 15:54:35          0 elb_logging/test/year=2015/
2018-02-06 15:55:00          0 elb_logging/test/year=2015/month=01/
2018-02-06 16:31:43         22 elb_logging/test/year=2015/month=01/test2.csv
2018-02-06 15:55:08          0 elb_logging/test/year=2015/month=02/
2018-02-06 16:31:57         22 elb_logging/test/year=2015/month=02/test3.csv

  1. Once Crawler was done, I have launched a new EMR cluster by pointing to Glue Catalog while launching. After that executed the following commands on the cluster (Hive & Spark). And it is showing the partition is proper format).
scala> spark.sql("SHOW PARTITIONS test").show(30,false)
18/02/06 16:27:59 WARN CredentialsLegacyConfigLocationProvider: Found the legacy config profiles file at [/home/hadoop/.aws/config]. Please move it to the latest default location [~/.aws/credentials].
+------------------+
|partition         |
+------------------+
|year=2015/month=01|
|year=2015/month=02|
+------------------+

hive> show partitions test;
OK
year=2015/month=02
year=2015/month=01
Time taken: 0.54 seconds, Fetched: 2 row(s)

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.651 seconds, Fetched: 10 row(s)


hive> select * from test;
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
1	kuntal	2015	01
2	Rock	2015	01
3	Cena	2015	01
Time taken: 3.047 seconds, Fetched: 6 row(s)

hive> select * from test where year='2015' and month='02';
OK
1	Monty	2015	02
2	Trish	2015	02
5	Lisa	2015	02
Time taken: 0.971 seconds, Fetched: 3 row(s)

================
3) Kindly note that the catalog-2-migration script (export_from_datacatalog.py) will not work with the following key constraint error:

"duplicate entry for key 'UNIQUE_DATABASE'
.....
java.sql.BatchUpdateException: Field 'IS_REWRITE_ENABLED' doesn't have a default value"

I found the column 'IS_REWRITE_ENABLED' is in table hive.TBLS. A strange thing I found is this column can be NULL in table definition. However, the Spark job complains about the default value. So I manually login to my Hive metastore and updated the default value:
ALTER TABLE hive.TBLS ALTER IS_REWRITE_ENABLED SET DEFAULT 1;

After this small change, the Glue ETL job completed successfully. But the partition generated by the script is totally incorrect.

Partition messed up

hive> show partitions test;
OK
year(string),month(string)=2015,01
year(string),month(string)=2015,02
Time taken: 0.176 seconds, Fetched: 2 row(s)

Although the table description is same-

hive> describe test;
OK
col0                	bigint              	                    
col1                	string              	                    
year                	string              	                    
month               	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
year                	string              	                    
month               	string              	                    
Time taken: 0.492 seconds, Fetched: 10 row(s)

So its totally an issue with the migration script and I'm stuck with our migration process

So kindly look into the issue on an urgent basis and fix the script or provide me a workaround or solution.

@dichenli
Copy link

dichenli commented Feb 19, 2018

Thank you for identifying the issue. I repeated the problem. It looks like there is a bug here: https://github.com/awslabs/aws-glue-samples/blob/fd8cab884e6f636be37f677cbfd7db7f6e9fc6ab/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L809

It parses partition keys and values from Glue into Hive partition names. The generated partName is like "year(string),month(string)=2015,02" whereas Hive expects something like "year=2015/month=02".

I found that "year(string),month(string)=2015,02" actually works on "DESCRIBE test" query on Hive 1.0.0, so the author might wrote code based on this version of Hive. But "year=2015/month=02" is the standard format, so I'll push a bug fix for it.

@ultrasonex
Copy link

HI @Kuntal-G
Thanks for replicating and posting the issue on our behalf.

Hi @dichenli ,

Thanks for picking this up. We are kind of blocked right now . Can you provide ETA for this ?
Really appreciate your help on this.

Thanks,
Niloy

@dichenli
Copy link

dichenli commented Feb 20, 2018

I apologize that I can't provide an ETA yet. A thorough fix may take some time to be pushed to GitHub, but to unblock yourself immediately, you may paste a quick fix code snippet below to replace the function with problem. It should work most of the time.

@staticmethod
def udf_partition_name_from_keys_vals(table_name, keys, vals):
        if not keys:
            logging.error('Glue table has is missing partition keys') # TODO add table name in error msg
            return '' # TODO should raise error? or filter out partitions with problem?
        if not vals:
            logging.error('Glue table has is missing partition values') 
            return ''
        if len(keys) != len(vals):
            logging.error('Glue table has different number of partition keys in table and values in partition')
            return ''
        s_keys = []
        for k, v in zip(keys, vals):
            s_keys.append('%s=%s' % (k['name'], v))

        return '/'.join(s_keys) # TODO escape chars in keys and values, see https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java#L256

@jiajie999
Copy link

jiajie999 commented Jun 20, 2018

Got the same problem,
Any update @dichenli?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants