Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airbyte creating too many attempts and not terminating old ones #38187

Open
mjrlgue opened this issue May 14, 2024 · 2 comments
Open

Airbyte creating too many attempts and not terminating old ones #38187

mjrlgue opened this issue May 14, 2024 · 2 comments
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working

Comments

@mjrlgue
Copy link

mjrlgue commented May 14, 2024

Helm Chart Version

0.44.1

What step the error happened?

During the Sync

Relevant information

In a managed kubernetes deployment managed by another team, we deployed airbyte and created some pipelines (csv to a clickhouse database and postresql to clickhouse), the sync was running everyday for couples of days but started failing, after troubleshooting, one of the orchestrator-repl-job-50-attempt-x that is responsible for writing data to clickhouse has insufficient cpu:
image

we could resolve it by adding more k8s nodes or free some resources, but we found out many pods starting with names (orchestrator-repl-job-50-attempt-X, destination-clickhouse-check-48-X-yzgdz, n-clickhouse-check-1dd1ea2d-a22d-4b9a-bc6d-828, rce-mysql-discover-09f290d4-c311-426e-bdc8-53f88f4059f1-0-eqymi, etc.) are not being deleted by airbyte. Seems that Airbyte is forcing the attempts one after another:
image
image

Checking the documentation about configuring jobs parameters, to force the number of attempts to be 2 for example like the SYNC_JOB_MAX_ATTEMPTS, but can't find where to configure it, is it by updating the configmap airbyte-env? or in which section in values.yml? need a confirmation about it for experimentation reason.

I'm new to airbyte and the main question why airbyte doesn't delete old pods when it is trying many attempts? is it a bug?

Thanks,

Marwane.

Relevant log output

"avgExecTimeInNanos" : "NaN"
    }
  }
}
2024-05-12 16:30:23 replication-orchestrator > failures: [ {
  "failureOrigin" : "replication",
  "internalMessage" : "io.airbyte.workers.exception.WorkerException: Failed to create pod for read step",
  "externalMessage" : "Something went wrong during replication",
  "metadata" : {
    "attemptNumber" : 4,
    "jobId" : 48
  },
  "stacktrace" : "java.lang.RuntimeException: io.airbyte.workers.exception.WorkerException: Failed to create pod for read step\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:214)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$1(BufferedReplicationWorker.java:177)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:252)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: io.airbyte.workers.exception.WorkerException: Failed to create pod for read step\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197)\n\tat io.airbyte.workers.process.AirbyteIntegrationLauncher.read(AirbyteIntegrationLauncher.java:226)\n\tat io.airbyte.workers.internal.DefaultAirbyteSource.start(DefaultAirbyteSource.java:84)\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:212)\n\t... 6 more\nCaused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [900000] milliseconds for [Pod] with name:[source-file-read-48-4-zjhxz] in namespace [airbyte].\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:893)\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:93)\n\tat io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:382)\n\tat io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:652)\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:193)\n\t... 9 more\n",
  "timestamp" : 1715531423092
}, {
  "failureOrigin" : "replication",
  "internalMessage" : "io.airbyte.workers.exception.WorkerException: Failed to create pod for write step",
  "externalMessage" : "Something went wrong during replication",
  "metadata" : {
    "attemptNumber" : 4,
    "jobId" : 48
  },
  "stacktrace" : "java.lang.RuntimeException: io.airbyte.workers.exception.WorkerException: Failed to create pod for write step\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:196)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$0(BufferedReplicationWorker.java:176)\n\tat io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:252)\n\tat java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: io.airbyte.workers.exception.WorkerException: Failed to create pod for write step\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197)\n\tat io.airbyte.workers.process.AirbyteIntegrationLauncher.write(AirbyteIntegrationLauncher.java:264)\n\tat io.airbyte.workers.internal.DefaultAirbyteDestination.start(DefaultAirbyteDestination.java:101)\n\tat io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:194)\n\t... 6 more\nCaused by: io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [900000] milliseconds for [Pod] with name:[destination-clickhouse-write-48-4-zjeby] in namespace [airbyte].\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:893)\n\tat io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:93)\n\tat io.airbyte.workers.process.KubePodProcess.waitForInitPodToRun(KubePodProcess.java:382)\n\tat io.airbyte.workers.process.KubePodProcess.<init>(KubePodProcess.java:652)\n\tat io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:193)\n\t... 9 more\n",
  "timestamp" : 1715531423096
} ]
2024-05-12 16:30:23 replication-orchestrator > Returning output...
2024-05-12 16:30:23 replication-orchestrator > Writing async status SUCCEEDED for KubePodInfo[namespace=airbyte, name=orchestrator-repl-job-48-attempt-4, mainContainerInfo=KubeContainerInfo[image=airbyte/container-orchestrator:0.50.55, pullPolicy=IfNotPresent]]...
2024-05-12 16:30:23 replication-orchestrator > 
2024-05-12 16:30:23 replication-orchestrator > ----- END REPLICATION -----
2024-05-12 16:30:23 replication-orchestrator > 
2024-05-12 16:30:24 platform > State Store reports orchestrator pod orchestrator-repl-job-48-attempt-4 succeeded
2024-05-12 16:30:25 platform > Retry State: RetryManager(completeFailureBackoffPolicy=BackoffPolicy(minInterval=PT10S, maxInterval=PT30M, base=3), partialFailureBackoffPolicy=null, successiveCompleteFailureLimit=5, totalCompleteFailureLimit=10, successivePartialFailureLimit=1000, totalPartialFailureLimit=10, successiveCompleteFailures=5, totalCompleteFailures=5, successivePartialFailures=0, totalPartialFailures=0)
 Backoff before next attempt: 13 minutes 30 seconds
2024-05-12 16:30:25 platform > Failing job: 48, reason: Job failed after too many retries for connection 3d45ba7e-a227-4d44-bff5-b0521340bbd5
@marcosmarxm
Copy link
Member

@mjrlgue, the 'pod-swepper' service is in charge of cleaning up these pods over time.

@davinchia, any ideas on this? It might potentially use up all the pods to be made in the namespace (though I'm not certain about this).

@davinchia
Copy link
Contributor

Indeed there should be a pod sweeper to remove terminal pods e.g. Completed or Failed.

I'm surprised it's the terminal pod buildup that is causing issues. Perhaps you are running into general resource issues? If you are, reducing the number of attempts won't really help since you are trading job reliability for resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community team/platform-move type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants