New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945

bbarodia · 2023-04-10T19:05:33Z

Hi,

We wanted to upgrade to 0.29.0 from 0.20.0 but noticed the following behavior.

Setup to write to BigQuery :

final_df.write.format("bigquery") \
            .option("temporaryGcsBucket", c.TEMPORARY_GCS_BUCKET) \
            .option("table", self.gbq_project + ":" + final_dataset_name + "." + output_short_table_name) \
            .option("createDisposition", "CREATE_IF_NEEDED") \
            .option("schemaUpdateOptions", "ALLOW_FIELD_ADDITION" if use_gbq_overwrite_mode else '') \
            .option("decimal", "NUMERIC") \
            .option("partitionField", partition_column) \
            .option("partitionType", "DAY") \
            .mode('overwrite' if use_gbq_overwrite_mode else 'append') \
            .save()

Test Case :

Write dataframe to table
Add a column to dataframe and write to table

expected behaviour : new column is added to table.

With 0.29.0 :

We do not see the new column in the table.

we see this in logs : autodetect=null and

 schema=Schema{fields=[Field{name=field_1, type=STRING, mode=NULLABLE, description=null, policyTags=null,
, maxLength=null, scale=null, precision=null, defaultValueExpression=null, collation=null}, Field{name=test_string_field, type=STRING, mode=REQUIRED, description=null, policyTags=null, maxLength=null, scale=null, precision=null, defaultValueExpression=null, collation=null},

with 0.20.0:
We DO see the new column in the table.

we see autodetect=true
and schema=null

The text was updated successfully, but these errors were encountered:

pricemg · 2023-04-11T14:16:28Z

I think I'm seeing this exact issue right now actually.

Having the following options set in my spark session config

        # Allow new fields to be added to struct(record in BQ) column that
        # differ from what is already present in the BQ table being written to.
        spark.conf.set("allowFieldAddition", "true")
        # Any fields different between current data to be appended and already
        # in BQ table are filled with Null.
        spark.conf.set("allowFieldRelaxation", "true"

With dataproc-serverless v1.0 (where we were using 2.12-0.23.2 of the connector) could append a dataframe to a table where there was not a perfect overlap of columns and the new columns would be added (with Null set in the rows were the new columns weren't present previously). In dataproc-serverless v2.0 (where we're now using the 2.13-0.28.0 connector) I am still seeing the rows being appended but the additional columns are not present.

pricemg · 2023-04-16T12:02:26Z

I can put together mwe if it's helpful, but I think @bbarodia has probably explained the premise well.

bbarodia · 2023-04-17T17:58:28Z

Hi Folks, we are observing similar behaviour with 0.30 too. Please let us know if this is expected or a bug.

Also BigQuery Console can handle adding a column schema without deleting previous partitions. The only way to write a dataframe that has a changed schema to BigQuery is to use the mode: overwrite which will delete the old partitions. Can this be handled with mode: append where we do not need to delete old partitions ?

suryasoma · 2023-04-17T19:22:58Z

Yeah this is a bug, we are investigating on what has changed which has caused this. Will update the ticket soon.
Thanks for pointing out the issue.

bbarodia · 2023-04-20T21:16:55Z

hi @davidrabinowitz :

When adding a field or deleting a field in an existing table for a particular partition day, when we use mode: overwrite for BigQuery, the older partitions will get deleted and only the partition that we wrote remains.

Is that the expected behavior from BigQuery ? Can we have old data in our partitions when changing schemas ? Are there any flags that we can use for BigQuery ?

We have tried using the flags, allowFieldAddition during write but that does not work.

suryasoma · 2023-04-21T00:04:51Z

hey, so in overwrite the data itself is overwrited so if you want the older data as well, mode: append is what you should use.
and with append to update the schema, need to set the allowFieldAddition option.
The connector is passing the the option to the BigQuery load job, but looks like there is an issue in load job. We have raised a bug on this and tracking it.
Will update the issue as soon as the bug is resolved.

Also found a similar issue googleapis/python-bigquery#1095

bbarodia · 2023-04-21T16:18:18Z

HI @suryasoma ,

Thank you for the clarification. I was expecting the same behavior as well. If possible, can you comment on what the timeline will look like ? will there be a new connector version released ?

Thanks

chalmerlowe · 2023-04-21T17:11:26Z

I am the primary caretaker the Python BigQuery library and have been looking at googleapis/python-bigquery#1095, but do not have a solution yet.

I am going to try and adjust my schedule for next week to see what I can do to direct some attention to the issue in the Python BQ library.

bbarodia · 2023-05-03T17:26:34Z

hi @suryasoma / @chalmerlowe :
Any updates on timelines to expect these changes ?

suryasoma · 2023-05-03T20:28:19Z

The fix for this is merged @bbarodia. You can find it in the next release.

pricemg · 2023-05-04T07:34:28Z

Amazing, thank you @suryasoma. My own ignorance, but how often are releases typically?

bbarodia · 2023-05-23T19:12:00Z

@suryasoma : could you please comment on when this would be released ?

suryasoma · 2023-05-23T19:58:13Z

We plan to have a release soon, in the upcoming weeks. I will update the ticket once the release is done.
Thanks :)

suryasoma · 2023-06-02T18:21:37Z

@bbarodia, please find the fix in the latest release 0.31.0

bbarodia · 2023-06-02T18:49:16Z

Thanks, will check it out

pricemg · 2023-06-07T10:19:47Z

@suryasoma I've just tried this and finding it is still not behaving exactly as it was before.

Running the below with Dataproc batches=2.0 and --jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.31.0.jar

    from pyspark.sql import (
        SparkSession,
        types as T,
    )

    spark = SparkSession.builder.getOrCreate()
    spark.conf.set("viewsEnabled", "true")
    spark.conf.set("materializationDataset", 'test_space')
    spark.conf.set('temporaryGcsBucket', staging_bucket)
    spark.conf.set("intermediateFormat", "orc")
    # Allow new fields to be added to struct(record in BQ) column that
    # differ from what is already present in the BQ table being written to.
    spark.conf.set("allowFieldAddition", "true")
    # Any fields different between current data to be appended and already
    # in BQ table are filled with Null.
    spark.conf.set("allowFieldRelaxation", "true")

    output_table_path = 'test_space.column_check'

    print('Case 1')
    df = spark.createDataFrame(
        data=[
            (1, 'a'),
            (2, 'b'),
        ],
        schema=T.StructType([
            T.StructField('number', T.IntegerType(), True),
            T.StructField('letter', T.StringType(), True),
        ])
    )
    df.show()
    df.printSchema()
    df.write.save(output_table_path, format="bigquery", mode='append')

    print('Case 2')
    df = spark.createDataFrame(
        data=[
            (3, 'c', True),
            (4, 'd', False),
        ],
        schema=T.StructType([
            T.StructField('number', T.IntegerType(), True),
            T.StructField('letter', T.StringType(), True),
            T.StructField('bool', T.BooleanType(), True),
        ])
    )
    df.show()
    df.printSchema()
    df.write.save(output_table_path, format="bigquery", mode='append')

    print('Case 3')
    df = spark.createDataFrame(
        data=[
            (5, 'e', (55, 'ee')),
            (6, 'f', (66, 'ff')),
        ],
        schema=T.StructType([
            T.StructField('number', T.IntegerType(), True),
            T.StructField('letter', T.StringType(), True),
            T.StructField(
                'struct_column',
                T.StructType([
                    T.StructField('more_numbers', T.IntegerType(), True),
                    T.StructField('more_letters', T.StringType(), True),
                ]),
                True
            ),
        ])
    )
    df.show()
    df.printSchema()
    df.write.save(output_table_path, format="bigquery", mode='append')

    print('Case 4')
    df = spark.createDataFrame(
        data=[
            (7, 'g', (77, 'gg', True)),
            (8, 'h', (88, 'hh', False)),
        ],
        schema=T.StructType([
            T.StructField('number', T.IntegerType(), True),
            T.StructField('letter', T.StringType(), True),
            T.StructField(
                'struct_column',
                T.StructType([
                    T.StructField('more_numbers', T.IntegerType(), True),
                    T.StructField('more_letters', T.StringType(), True),
                    T.StructField('more_bools', T.BooleanType(), True),
                ]),
                True
            ),
        ])
    )
    df.show()
    df.printSchema()
    df.write.save(output_table_path, format="bigquery", mode='append')

works for writing out case 1, and then case 2 which adds a new column, however it then crashes when attempting to write case 3.

bbarodia · 2023-06-13T17:27:29Z

Found an issue when adding a new numeric field : #997

bkinzle · 2023-09-17T19:04:13Z

Fyi, this is still not working for me (new fields are not being created in the bigquery table I'm loading into) with the latest version of the connector:
spark-bigquery-with-dependencies_2.13-0.32.2.jar

I'm not sure if it should matter but in addtion to schemaUpdateOptions=[ALLOW_FIELD_ADDITION], my load job is also using:

writeDisposition=WRITE_TRUNCATE
The target table is partitioned by DAY and a specific datePartition is being input.

jherrmannNetfonds · 2023-10-18T11:47:07Z

I have the same problem also with spark-bigquery-with-dependencies_2.12-0.32.2.jar having a partitioned table by date, set intermediateFormat to avro, useAvroLogicalTypes to true, allowFieldAddition to true and using override. I am using pyspark
Getting:
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table XXXX. Cannot add fields (field: xxxxxxx)

The LoadJobConfiguration in the logs shows schemaUpdateOptions=null, I would expect ALLOW_FIELD_ADDITION

jherrmannNetfonds · 2023-10-18T12:54:54Z

I solved my issue when using pyspark:
I was using:
df.write.format("bigquery").option("table", "table_name").option("allowFieldAddition ", True)
this is not working. If I add allowFieldAddition to spark config via spark.conf.set("allowFieldAddition", "true"), it is working.
The strange think is, if I add allowFieldRelaxation as an option to the writer, the schemaUpdateOptions correctly contains ALLOW_FIELD_RELAXATION but not ALLOW_FIELD_ADDITION if it is added as an option to the writer.

rui-castro-ebury · 2023-12-05T11:05:18Z

I have a similar case to @pricemg
adding new fields is working, if those new fields are at the dataframe root level!
if i have a struct field and i change it by adding a new field, it will fail
Provided Schema does not match Table XXXX. Cannot add required fields to an existing schema. (field: struct.struct_c)

any field addition to a "sub-level" fails.

vishalkarve15 · 2023-12-06T06:44:16Z

@rui-castro-ebury can you please create a new issue? You can add a reference to this issue if needed.

erajabi · 2024-04-25T10:32:15Z

The following worked for me in big query pyspark connector:
.option("allowFieldAddition ", True)

bbarodia changed the title ~~Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0~~ New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 Apr 10, 2023

suryasoma self-assigned this Apr 14, 2023

davidrabinowitz mentioned this issue Apr 20, 2023

Behavior difference in BigQuery vs Hive when changing schema of a table with saveMode=overwrite #953

Closed

suryasoma linked a pull request May 3, 2023 that will close this issue

fix for allowFieldAddition not working in indirect mode #958

Merged

suryasoma closed this as completed Jun 5, 2023

pricemg mentioned this issue Jun 7, 2023

Adding new columns in struct type columns no longer possible in BQ #990

Closed

davidrabinowitz mentioned this issue Jun 9, 2023

Fix to support allowFieldAddition for columns with nested fields #991

Merged

davidrabinowitz assigned vishalkarve15 and unassigned suryasoma Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945

New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945

bbarodia commented Apr 10, 2023

pricemg commented Apr 11, 2023 •

edited

pricemg commented Apr 16, 2023

bbarodia commented Apr 17, 2023

suryasoma commented Apr 17, 2023

bbarodia commented Apr 20, 2023

suryasoma commented Apr 21, 2023

bbarodia commented Apr 21, 2023

chalmerlowe commented Apr 21, 2023

bbarodia commented May 3, 2023

suryasoma commented May 3, 2023

pricemg commented May 4, 2023

bbarodia commented May 23, 2023

suryasoma commented May 23, 2023

suryasoma commented Jun 2, 2023

bbarodia commented Jun 2, 2023

pricemg commented Jun 7, 2023 •

edited

bbarodia commented Jun 13, 2023

bkinzle commented Sep 17, 2023 •

edited

jherrmannNetfonds commented Oct 18, 2023 •

edited

jherrmannNetfonds commented Oct 18, 2023

rui-castro-ebury commented Dec 5, 2023

vishalkarve15 commented Dec 6, 2023

erajabi commented Apr 25, 2024

New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945

New Column addition in Spark Bigquery Connector 2.12_0.29.0 vs 2.12_0.20.0 #945

Comments

bbarodia commented Apr 10, 2023

pricemg commented Apr 11, 2023 • edited

pricemg commented Apr 16, 2023

bbarodia commented Apr 17, 2023

suryasoma commented Apr 17, 2023

bbarodia commented Apr 20, 2023

suryasoma commented Apr 21, 2023

bbarodia commented Apr 21, 2023

chalmerlowe commented Apr 21, 2023

bbarodia commented May 3, 2023

suryasoma commented May 3, 2023

pricemg commented May 4, 2023

bbarodia commented May 23, 2023

suryasoma commented May 23, 2023

suryasoma commented Jun 2, 2023

bbarodia commented Jun 2, 2023

pricemg commented Jun 7, 2023 • edited

bbarodia commented Jun 13, 2023

bkinzle commented Sep 17, 2023 • edited

jherrmannNetfonds commented Oct 18, 2023 • edited

jherrmannNetfonds commented Oct 18, 2023

rui-castro-ebury commented Dec 5, 2023

vishalkarve15 commented Dec 6, 2023

erajabi commented Apr 25, 2024

pricemg commented Apr 11, 2023 •

edited

pricemg commented Jun 7, 2023 •

edited

bkinzle commented Sep 17, 2023 •

edited

jherrmannNetfonds commented Oct 18, 2023 •

edited