You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi.
I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.
I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema
But with this property, It does not work either. The error that appears is the following:
org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table
at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at org.apache.hudi.BaseFileOnlyRelation.(BaseFileOnlyRelation.scala:69)
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 63 elided
The text was updated successfully, but these errors were encountered:
hoodie.datasource.write.drop.partition.columns is setup by default as false, which means the data file does not include the partition columns, the partition field you declared here should be a field name instead of a value CID@12.
And is there any way of partitioning the data using a hash function of the row primary key to improve the performance for update rows. I have developed my custom BuiltinKeyGenerator overwriting the method getPartitionPath (I get the partitionPath, which is the primary key and I apply the operation % numBuckets) but the problem is that when I read the data, the value for the primary key column is the value of the operation instead of the real value.
Hi.
I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.
The initial load is the following:
spark.read.option("mergeSchema","true").parquet("PATH").
withColumn("_hoodie_is_deleted", lit(false)).
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","CID").
option("hoodie.datasource.write.precombine.field","sequential_total").
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator").
option("hoodie.datasource.write.partitionpath.field","CID@12").
option("hoodie.datasource.write.drop.partition.columns","true").
mode(Overwrite).
save("/tmp/hudi2")
I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema
But with this property, It does not work either. The error that appears is the following:
org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table
at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at org.apache.hudi.BaseFileOnlyRelation.(BaseFileOnlyRelation.scala:69)
at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 63 elided
The text was updated successfully, but these errors were encountered: