Airbyte creating too many attempts and not terminating old ones #38187

mjrlgue · 2024-05-14T14:04:30Z

Helm Chart Version

0.44.1

What step the error happened?

During the Sync

Relevant information

In a managed kubernetes deployment managed by another team, we deployed airbyte and created some pipelines (csv to a clickhouse database and postresql to clickhouse), the sync was running everyday for couples of days but started failing, after troubleshooting, one of the orchestrator-repl-job-50-attempt-x that is responsible for writing data to clickhouse has insufficient cpu:

we could resolve it by adding more k8s nodes or free some resources, but we found out many pods starting with names (orchestrator-repl-job-50-attempt-X, destination-clickhouse-check-48-X-yzgdz, n-clickhouse-check-1dd1ea2d-a22d-4b9a-bc6d-828, rce-mysql-discover-09f290d4-c311-426e-bdc8-53f88f4059f1-0-eqymi, etc.) are not being deleted by airbyte. Seems that Airbyte is forcing the attempts one after another:

Checking the documentation about configuring jobs parameters, to force the number of attempts to be 2 for example like the SYNC_JOB_MAX_ATTEMPTS, but can't find where to configure it, is it by updating the configmap airbyte-env? or in which section in values.yml? need a confirmation about it for experimentation reason.

I'm new to airbyte and the main question why airbyte doesn't delete old pods when it is trying many attempts? is it a bug?

Thanks,

Marwane.

Relevant log output

"avgExecTimeInNanos" : "NaN"
    }
  }
}
2024-05-12 16:30:23 replication-orchestrator > failures: [ {
  "failureOrigin" : "replication",
  "internalMessage" : "io.airbyte.workers.exception.WorkerException: Failed to create pod for read step",
  "externalMessage" : "Something went wrong during replication",
  "metadata" : {
    "attemptNumber" : 4,
    "jobId" : 48
  },
  "stacktrace" : "java.lang.RuntimeException: io.airbyte.workers.exception.WorkerException: Failed to create pod for read step\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:214)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$1(BufferedReplicationWorker.java:177)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:252)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: io.airbyte.workers.exception.WorkerException: Failed to create pod for read step\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197)\n\tat io.airbyte.workers.process.AirbyteIntegrationLauncher.read(AirbyteIntegrationLauncher.java:226)\n\tat io.airbyte.workers.internal.DefaultAirbyteSource.start(DefaultAirbyteSource.java:84)\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:212)\n\t... 6 more\nCaused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [900000] milliseconds for [Pod] with name:[source-file-read-48-4-zjhxz] in namespace [airbyte].\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:893)\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:93)\n\tat io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:382)\n\tat io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:652)\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:193)\n\t... 9 more\n",
  "timestamp" : 1715531423092
}, {
  "failureOrigin" : "replication",
  "internalMessage" : "io.airbyte.workers.exception.WorkerException: Failed to create pod for write step",
  "externalMessage" : "Something went wrong during replication",
  "metadata" : {
    "attemptNumber" : 4,
    "jobId" : 48
  },
  "stacktrace" : "java.lang.RuntimeException: io.airbyte.workers.exception.WorkerException: Failed to create pod for write step\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:196)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$0(BufferedReplicationWorker.java:176)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:252)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: io.airbyte.workers.exception.WorkerException: Failed to create pod for write step\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197)\n\tat io.airbyte.workers.process.AirbyteIntegrationLauncher.write(AirbyteIntegrationLauncher.java:264)\n\tat io.airbyte.workers.internal.DefaultAirbyteDestination.start(DefaultAirbyteDestination.java:101)\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:194)\n\t... 6 more\nCaused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [900000] milliseconds for [Pod] with name:[destination-clickhouse-write-48-4-zjeby] in namespace [airbyte].\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:893)\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:93)\n\tat io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:382)\n\tat io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:652)\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:193)\n\t... 9 more\n",
  "timestamp" : 1715531423096
} ]
2024-05-12 16:30:23 replication-orchestrator > Returning output...
2024-05-12 16:30:23 replication-orchestrator > Writing async status SUCCEEDED for KubePodInfo[namespace=airbyte, name=orchestrator-repl-job-48-attempt-4, mainContainerInfo=KubeContainerInfo[image=airbyte/container-orchestrator:0.50.55, pullPolicy=IfNotPresent]]...
2024-05-12 16:30:23 replication-orchestrator > 
2024-05-12 16:30:23 replication-orchestrator > ----- END REPLICATION -----
2024-05-12 16:30:23 replication-orchestrator > 
2024-05-12 16:30:24 platform > State Store reports orchestrator pod orchestrator-repl-job-48-attempt-4 succeeded
2024-05-12 16:30:25 platform > Retry State: RetryManager(completeFailureBackoffPolicy=BackoffPolicy(minInterval=PT10S, maxInterval=PT30M, base=3), partialFailureBackoffPolicy=null, successiveCompleteFailureLimit=5, totalCompleteFailureLimit=10, successivePartialFailureLimit=1000, totalPartialFailureLimit=10, successiveCompleteFailures=5, totalCompleteFailures=5, successivePartialFailures=0, totalPartialFailures=0)
 Backoff before next attempt: 13 minutes 30 seconds
2024-05-12 16:30:25 platform > Failing job: 48, reason: Job failed after too many retries for connection 3d45ba7e-a227-4d44-bff5-b0521340bbd5

The text was updated successfully, but these errors were encountered:

marcosmarxm · 2024-05-16T13:20:03Z

@mjrlgue, the 'pod-swepper' service is in charge of cleaning up these pods over time.

@davinchia, any ideas on this? It might potentially use up all the pods to be made in the namespace (though I'm not certain about this).

davinchia · 2024-05-16T18:56:57Z

Indeed there should be a pod sweeper to remove terminal pods e.g. Completed or Failed.

I'm surprised it's the terminal pod buildup that is causing issues. Perhaps you are running into general resource issues? If you are, reducing the number of attempts won't really help since you are trading job reliability for resources.

mjrlgue added area/platform issues related to the platform needs-triage type/bug Something isn't working labels May 14, 2024

octavia-squidington-iii added autoteam team/compose team/platform-move community labels May 14, 2024

marcosmarxm removed needs-triage team/compose autoteam labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airbyte creating too many attempts and not terminating old ones #38187

Airbyte creating too many attempts and not terminating old ones #38187

mjrlgue commented May 14, 2024 •

edited

marcosmarxm commented May 16, 2024

davinchia commented May 16, 2024

Airbyte creating too many attempts and not terminating old ones #38187

Airbyte creating too many attempts and not terminating old ones #38187

Comments

mjrlgue commented May 14, 2024 • edited

Helm Chart Version

What step the error happened?

Relevant information

Relevant log output

marcosmarxm commented May 16, 2024

davinchia commented May 16, 2024

mjrlgue commented May 14, 2024 •

edited