-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945
Comments
I think I'm seeing this exact issue right now actually. Having the following options set in my spark session config # Allow new fields to be added to struct(record in BQ) column that
# differ from what is already present in the BQ table being written to.
spark.conf.set("allowFieldAddition", "true")
# Any fields different between current data to be appended and already
# in BQ table are filled with Null.
spark.conf.set("allowFieldRelaxation", "true" With dataproc-serverless v1.0 (where we were using 2.12-0.23.2 of the connector) could append a dataframe to a table where there was not a perfect overlap of columns and the new columns would be added (with |
I can put together mwe if it's helpful, but I think @bbarodia has probably explained the premise well. |
Hi Folks, we are observing similar behaviour with 0.30 too. Please let us know if this is expected or a bug. Also BigQuery Console can handle adding a column schema without deleting previous partitions. The only way to write a dataframe that has a changed schema to BigQuery is to use the |
Yeah this is a bug, we are investigating on what has changed which has caused this. Will update the ticket soon. |
hi @davidrabinowitz : When adding a field or deleting a field in an existing table for a particular partition day, when we use Is that the expected behavior from BigQuery ? Can we have old data in our partitions when changing schemas ? Are there any flags that we can use for BigQuery ? We have tried using the flags, allowFieldAddition during write but that does not work. |
hey, so in Also found a similar issue googleapis/python-bigquery#1095 |
HI @suryasoma , Thank you for the clarification. I was expecting the same behavior as well. If possible, can you comment on what the timeline will look like ? will there be a new connector version released ? Thanks |
I am the primary caretaker the Python BigQuery library and have been looking at googleapis/python-bigquery#1095, but do not have a solution yet. I am going to try and adjust my schedule for next week to see what I can do to direct some attention to the issue in the Python BQ library. |
hi @suryasoma / @chalmerlowe : |
The fix for this is merged @bbarodia. You can find it in the next release. |
Amazing, thank you @suryasoma. My own ignorance, but how often are releases typically? |
@suryasoma : could you please comment on when this would be released ? |
We plan to have a release soon, in the upcoming weeks. I will update the ticket once the release is done. |
@bbarodia, please find the fix in the latest release |
Thanks, will check it out |
@suryasoma I've just tried this and finding it is still not behaving exactly as it was before. Running the below with Dataproc batches=2.0 and from pyspark.sql import (
SparkSession,
types as T,
)
spark = SparkSession.builder.getOrCreate()
spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", 'test_space')
spark.conf.set('temporaryGcsBucket', staging_bucket)
spark.conf.set("intermediateFormat", "orc")
# Allow new fields to be added to struct(record in BQ) column that
# differ from what is already present in the BQ table being written to.
spark.conf.set("allowFieldAddition", "true")
# Any fields different between current data to be appended and already
# in BQ table are filled with Null.
spark.conf.set("allowFieldRelaxation", "true")
output_table_path = 'test_space.column_check'
print('Case 1')
df = spark.createDataFrame(
data=[
(1, 'a'),
(2, 'b'),
],
schema=T.StructType([
T.StructField('number', T.IntegerType(), True),
T.StructField('letter', T.StringType(), True),
])
)
df.show()
df.printSchema()
df.write.save(output_table_path, format="bigquery", mode='append')
print('Case 2')
df = spark.createDataFrame(
data=[
(3, 'c', True),
(4, 'd', False),
],
schema=T.StructType([
T.StructField('number', T.IntegerType(), True),
T.StructField('letter', T.StringType(), True),
T.StructField('bool', T.BooleanType(), True),
])
)
df.show()
df.printSchema()
df.write.save(output_table_path, format="bigquery", mode='append')
print('Case 3')
df = spark.createDataFrame(
data=[
(5, 'e', (55, 'ee')),
(6, 'f', (66, 'ff')),
],
schema=T.StructType([
T.StructField('number', T.IntegerType(), True),
T.StructField('letter', T.StringType(), True),
T.StructField(
'struct_column',
T.StructType([
T.StructField('more_numbers', T.IntegerType(), True),
T.StructField('more_letters', T.StringType(), True),
]),
True
),
])
)
df.show()
df.printSchema()
df.write.save(output_table_path, format="bigquery", mode='append')
print('Case 4')
df = spark.createDataFrame(
data=[
(7, 'g', (77, 'gg', True)),
(8, 'h', (88, 'hh', False)),
],
schema=T.StructType([
T.StructField('number', T.IntegerType(), True),
T.StructField('letter', T.StringType(), True),
T.StructField(
'struct_column',
T.StructType([
T.StructField('more_numbers', T.IntegerType(), True),
T.StructField('more_letters', T.StringType(), True),
T.StructField('more_bools', T.BooleanType(), True),
]),
True
),
])
)
df.show()
df.printSchema()
df.write.save(output_table_path, format="bigquery", mode='append') works for writing out case 1, and then case 2 which adds a new column, however it then crashes when attempting to write case 3. |
Found an issue when adding a new numeric field : #997 |
Fyi, this is still not working for me (new fields are not being created in the bigquery table I'm loading into) with the latest version of the connector: I'm not sure if it should matter but in addtion to
|
I have the same problem also with The LoadJobConfiguration in the logs shows |
I solved my issue when using pyspark: |
I have a similar case to @pricemg any field addition to a "sub-level" fails. |
@rui-castro-ebury can you please create a new issue? You can add a reference to this issue if needed. |
The following worked for me in big query pyspark connector: |
Hi,
We wanted to upgrade to 0.29.0 from 0.20.0 but noticed the following behavior.
Setup to write to BigQuery :
Test Case :
expected behaviour : new column is added to table.
With 0.29.0 :
We do not see the new column in the table.
we see this in logs :
autodetect=null
andwith 0.20.0:
We DO see the new column in the table.
we see
autodetect=true
and schema=null
The text was updated successfully, but these errors were encountered: