Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled long running test failed - Run ID: 8150844893 #7264

Closed
rad-ci-bot opened this issue Mar 5, 2024 · 12 comments
Closed

Scheduled long running test failed - Run ID: 8150844893 #7264

rad-ci-bot opened this issue Mar 5, 2024 · 12 comments
Labels
bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated

Comments

@rad-ci-bot
Copy link
Collaborator

rad-ci-bot commented Mar 5, 2024

Bug information

This bug is generated automatically if the scheduled long running test fails. The Radius long running test operates on a schedule of every 2 hours everyday. It's important to understand that the test may fail due to workflow infrastructure issues, like network problems, rather than the flakiness of the test itself. For the further investigation, please visit here.

AB#11363

@rad-ci-bot rad-ci-bot added bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated labels Mar 5, 2024
@radius-triage-bot
Copy link

👋 @rad-ci-bot Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

@kachawla
Copy link
Contributor

kachawla commented Mar 5, 2024

cli.go:418: [rad] Error: {
    cli.go:418: [rad]   "code": "Internal",
    cli.go:418: [rad]   "message": "could not find API version for type \"core/Secret\": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
    cli.go:418: [rad] }
    cli.go:418: [rad] 
    cli.go:418: [rad] TraceId:  904c0ce5417e28c548ec4886eb67df17
    cli.go:418: [rad] 
    cli.go:418: [rad] 
    rptest.go:411: 
        	Error Trace:	/home/runner/work/radius/radius/test/functional/shared/rptest.go:411
        	Error:      	Received unexpected error:
        	            	command 'rad application delete --yes -a daprrp-rs-statestore-manual' had non-zero exit code: exit status 1
        	Test:       	Test_DaprStateStore_Manual
        	Messages:   	failed to delete daprrp-rs-statestore-manual
    rptest.go:312: running step 0 of 1: deploy testdata/daprrp-resources-statestore-manual.bicep

@kachawla
Copy link
Contributor

kachawla commented Mar 6, 2024

Can't seem to find this error in RP/UCP logs and traceid doesn't exist in the logs either.

Based on the log, it should be coming from either https://github.com/radius-project/radius/blob/main/pkg/portableresources/processors/resourceclient.go#L245 or https://github.com/radius-project/radius/blob/main/pkg/corerp/handlers/kubernetes.go#L169, so should definitely show up in rp logs.

Will continue looking into this tomorrow.

@kachawla
Copy link
Contributor

kachawla commented Mar 7, 2024

Ended spending time debugging and fixing #7270 since that was blocking long haul test workflow completely. Will get back to this today.

@kachawla
Copy link
Contributor

kachawla commented Mar 9, 2024

Looks like we aren't logging errors on failure for multiple requests paths, attaching logs for reference. Working on a fix for logging so that we have sufficient information to work with for future debugging, but at this point there isn't enough information to identify the root cause here.

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

Also, since some of the failures like this one are not reproducible without a visibility into the cluster's state at the time of failure. To help with this, we should log status of all the pods running in the cluster when a request fails. I'll create an issue for this.

all_container_logs.zip

@kachawla
Copy link
Contributor

kachawla commented Mar 9, 2024

#7296

@rynowak
Copy link
Contributor

rynowak commented Mar 9, 2024

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

This is relevant because WE install an API service. Right now the we handle auth is by tunneling requests to Radius through the Kubernetes API server.

If this is causing reliability problems or making upgrades difficult then we should choose another approach. We did this because it provides authentication and reachability (DNS/TLS). It isn't essential to the way that Radius works and we could choose different approaches.

@kachawla
Copy link
Contributor

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

This is relevant because WE install an API service. Right now the we handle auth is by tunneling requests to Radius through the Kubernetes API server.

If this is causing reliability problems or making upgrades difficult then we should choose another approach. We did this because it provides authentication and reachability (DNS/TLS). It isn't essential to the way that Radius works and we could choose different approaches.

I didn't say it was irrelevant, but I don't have enough data points to suggest what needs to be done as I mentioned above. Please feel free to log an issue for what you are suggesting.

Going to close this test failure issue as per my notes above.

@rynowak
Copy link
Contributor

rynowak commented Mar 10, 2024

That sounds right to me. We don't have a lot of evidence either way.

I wanted to mention it, because we did this a while back and I'm not sure it's a good idea. Since we did that, I noticed a lot of other projects that expose an API from Kubernetes choose a port-forward approach.

@kachawla
Copy link
Contributor

kachawla commented Mar 12, 2024

That sounds right to me. We don't have a lot of evidence either way.

I wanted to mention it, because we did this a while back and I'm not sure it's a good idea. Since we did that, I noticed a lot of other projects that expose an API from Kubernetes choose a port-forward approach.

Makes sense. Agree, we should revisit it, and if it starts creating operational pain then definitely needs to be done sooner than later.

@rynowak
Copy link
Contributor

rynowak commented Mar 12, 2024

We're run into two issues so far, both were solvable. If we're causing unreliability of the API server, then that's strike three!

@kachawla
Copy link
Contributor

kachawla commented Mar 12, 2024

We're run into two issues so far, both were solvable. If we're causing unreliability of the API server, then that's strike three!

I was agreeing with your previous comment. Are you suggesting we revisit the approach now? Could you share a link to the second issue - this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated
Projects
None yet
Development

No branches or pull requests

3 participants