feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

jclarysse · 2024-04-19T08:37:04Z

Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset. This happens although the Dataset exists in the same GCP region as where the connector is running. This brings the task to fail and restarting it resumes without any error. This change suggests to retry insertion when this error occurs, instead of failing fast.

Every couple of days, BigQuery API might return 404 / NOT_FOUND for Dataset. This happens although the Dataset exists in the same GCP region as where the connector is running. This brings the task to fail and restarting it resumes without any error. This change suggest to retry insertion when this error occurs instead of failing fast.

C0urante · 2024-04-24T14:13:38Z

Thanks @jclarysse, and sorry for the delay (whole company is doing an off-site this week).

Do you have more information about the circumstances that might lead to these kinds of spurious dataset-not-found errors? It seems like this should be reported upstream as a bug in BigQuery.

As far as the fix goes, it looks like this will add latency to the time it takes the connector to fail if it tries to write to a dataset that really does not exist. Would a single retry be sufficient instead?

I also think we may want to add this logic to more places than just the AdaptiveBigQueryWriter, since IIRC that class is only used when table creation/updates are enabled.

jclarysse · 2024-04-29T09:51:46Z

Thanks @C0urante for following-up on this.

The dataset-not-found error is the very infrequent result of BigQuery API tabledata insertAll requests. The log is as follow:

[2024-04-12 23:02:32,513] WARN [tilbud-offers-sink|task-1] Could not write batch of size 1 to BigQuery. Error code: 404, underlying error (if present): BigQueryError{reason=notFound, location=null, message=Not found: Dataset some-project-id:some_dataset} (com.wepay.kafka.connect.bigquery.write.batch.TableWriter:97)
com.google.cloud.bigquery.BigQueryException: Not found: Dataset some-project-id:some_dataset
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:494)
	at com.google.cloud.bigquery.BigQueryImpl$28.call(BigQueryImpl.java:1068)
	at com.google.cloud.bigquery.BigQueryImpl$28.call(BigQueryImpl.java:1065)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.bigquery.BigQueryImpl.insertAll(BigQueryImpl.java:1064)
	at com.wepay.kafka.connect.bigquery.write.row.AdaptiveBigQueryWriter.performWriteRequest(AdaptiveBigQueryWriter.java:96)
	at com.wepay.kafka.connect.bigquery.write.row.BigQueryWriter.writeRows(BigQueryWriter.java:116)
	at com.wepay.kafka.connect.bigquery.write.batch.TableWriter.run(TableWriter.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://www.googleapis.com/bigquery/v2/projects/some-project-id/datasets/some_dataset/tables/some_table$20240412/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Dataset some-project-id:some_dataset",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Dataset some-project-id:some_dataset",
  "status" : "NOT_FOUND"
}
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
	at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.insertAll(HttpBigQueryRpc.java:492)
	... 12 more

The error does not depend on the batch size. As far as I am aware of, it only occurred with partitioned tables from datasets located in multi-regions EU (physically stored in GCP region europe-west1) and so there might be an API bug related to this specific scenario.

In general, the problem felt similar to the old known BigQuery: 404 table not found even when the table exists and my understanding was that issues related to BigQuery eventual consistency should be handled on client side. In this case there is no backend error or quota limit, and so the connector option bigQueryRetry doesn't apply. We were looking for another way to retry and noticed that the table-not-found-scenario was already handled here. Apart from that, I agree that a single retry should be sufficient.

Does it sound like a valid PR or do you feel that we are going into the wrong direction?

C0urante · 2024-04-30T14:45:20Z

Thanks for the clarification!

Regarding this question:

Does it sound like a valid PR or do you feel that we are going into the wrong direction?

I think retrying to handle any poor backend behavior is fine, regardless of whether it's expected (e.g., a documented limitation due to eventual consistency) or unexpected (e.g., a bug that might be patched in the future). The only difference is that if it seems like a bug, it should be reported upstream.

One thing I'm still unclear about is whether this is related to recently-created tables or datasets (which would definitely fall under the umbrella of eventual consistency issues), or if it occurs for tables/datasets that have existed for a while.

If it's for newly-created entities, then I think this patch is in pretty good shape.

If it's for entities that have existed for a while (and/or for which at least one write has already succeeded), then I think the logic should be moved out of the AdaptiveBigQueryWriter class and into the parent class (so that retries occur regardless of whether automatic table creation/updates are enabled in the connector) and we should limit the number of retries that take place (so that we can fail faster when the dataset truly does not exist).

How does that sound?

jclarysse · 2024-05-23T08:20:26Z

Thanks @C0urante, great advice!
Meanwhile, the user seems to have moved away from this issue by enabling Storage Write API.
As a result, I suggest to not further invest in this PR unless someone else would face the same issue.

jclarysse closed this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

jclarysse commented Apr 19, 2024

C0urante commented Apr 24, 2024

jclarysse commented Apr 29, 2024

C0urante commented Apr 30, 2024

jclarysse commented May 23, 2024

feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

feat: Retry row insertion when BigQuery API returns NOT_FOUND dataset #30

Conversation

jclarysse commented Apr 19, 2024

C0urante commented Apr 24, 2024

jclarysse commented Apr 29, 2024

C0urante commented Apr 30, 2024

jclarysse commented May 23, 2024