Scheduled long running test failed - Run ID: 8150844893 #7264

rad-ci-bot · 2024-03-05T04:38:46Z

Bug information

This bug is generated automatically if the scheduled long running test fails. The Radius long running test operates on a schedule of every 2 hours everyday. It's important to understand that the test may fail due to workflow infrastructure issues, like network problems, rather than the flakiness of the test itself. For the further investigation, please visit here.

AB#11363

radius-triage-bot · 2024-03-05T04:38:59Z

👋 @rad-ci-bot Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

kachawla · 2024-03-05T21:51:50Z

cli.go:418: [rad] Error: {
    cli.go:418: [rad]   "code": "Internal",
    cli.go:418: [rad]   "message": "could not find API version for type \"core/Secret\": unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
    cli.go:418: [rad] }
    cli.go:418: [rad] 
    cli.go:418: [rad] TraceId:  904c0ce5417e28c548ec4886eb67df17
    cli.go:418: [rad] 
    cli.go:418: [rad] 
    rptest.go:411: 
        	Error Trace:	/home/runner/work/radius/radius/test/functional/shared/rptest.go:411
        	Error:      	Received unexpected error:
        	            	command 'rad application delete --yes -a daprrp-rs-statestore-manual' had non-zero exit code: exit status 1
        	Test:       	Test_DaprStateStore_Manual
        	Messages:   	failed to delete daprrp-rs-statestore-manual
    rptest.go:312: running step 0 of 1: deploy testdata/daprrp-resources-statestore-manual.bicep

kachawla · 2024-03-06T02:18:22Z

Can't seem to find this error in RP/UCP logs and traceid doesn't exist in the logs either.

Based on the log, it should be coming from either https://github.com/radius-project/radius/blob/main/pkg/portableresources/processors/resourceclient.go#L245 or https://github.com/radius-project/radius/blob/main/pkg/corerp/handlers/kubernetes.go#L169, so should definitely show up in rp logs.

Will continue looking into this tomorrow.

kachawla · 2024-03-07T22:51:59Z

Ended spending time debugging and fixing #7270 since that was blocking long haul test workflow completely. Will get back to this today.

kachawla · 2024-03-09T02:12:44Z

Looks like we aren't logging errors on failure for multiple requests paths, attaching logs for reference. Working on a fix for logging so that we have sufficient information to work with for future debugging, but at this point there isn't enough information to identify the root cause here.

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

Also, since some of the failures like this one are not reproducible without a visibility into the cluster's state at the time of failure. To help with this, we should log status of all the pods running in the cluster when a request fails. I'll create an issue for this.

all_container_logs.zip

kachawla · 2024-03-09T03:07:20Z

#7296

rynowak · 2024-03-09T04:41:30Z

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

This is relevant because WE install an API service. Right now the we handle auth is by tunneling requests to Radius through the Kubernetes API server.

If this is causing reliability problems or making upgrades difficult then we should choose another approach. We did this because it provides authentication and reachability (DNS/TLS). It isn't essential to the way that Radius works and we could choose different approaches.

kachawla · 2024-03-10T21:41:36Z

I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again.

This is relevant because WE install an API service. Right now the we handle auth is by tunneling requests to Radius through the Kubernetes API server.

If this is causing reliability problems or making upgrades difficult then we should choose another approach. We did this because it provides authentication and reachability (DNS/TLS). It isn't essential to the way that Radius works and we could choose different approaches.

I didn't say it was irrelevant, but I don't have enough data points to suggest what needs to be done as I mentioned above. Please feel free to log an issue for what you are suggesting.

Going to close this test failure issue as per my notes above.

rynowak · 2024-03-10T21:51:25Z

That sounds right to me. We don't have a lot of evidence either way.

I wanted to mention it, because we did this a while back and I'm not sure it's a good idea. Since we did that, I noticed a lot of other projects that expose an API from Kubernetes choose a port-forward approach.

kachawla · 2024-03-12T01:14:37Z

That sounds right to me. We don't have a lot of evidence either way.

I wanted to mention it, because we did this a while back and I'm not sure it's a good idea. Since we did that, I noticed a lot of other projects that expose an API from Kubernetes choose a port-forward approach.

Makes sense. Agree, we should revisit it, and if it starts creating operational pain then definitely needs to be done sooner than later.

rynowak · 2024-03-12T02:53:55Z

We're run into two issues so far, both were solvable. If we're causing unreliability of the API server, then that's strike three!

kachawla · 2024-03-12T17:04:59Z

We're run into two issues so far, both were solvable. If we're causing unreliability of the API server, then that's strike three!

I was agreeing with your previous comment. Are you suggesting we revisit the approach now? Could you share a link to the second issue - this one?

rad-ci-bot added bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated labels Mar 5, 2024

kachawla closed this as completed Mar 10, 2024

kachawla mentioned this issue Mar 11, 2024

Log Errors for Failed Requests #7297

Closed

kachawla mentioned this issue Mar 17, 2024

Log errors for failed requests #7296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled long running test failed - Run ID: 8150844893 #7264

Scheduled long running test failed - Run ID: 8150844893 #7264

rad-ci-bot commented Mar 5, 2024 •

edited by azure-boards bot

radius-triage-bot bot commented Mar 5, 2024

kachawla commented Mar 5, 2024

kachawla commented Mar 6, 2024

kachawla commented Mar 7, 2024

kachawla commented Mar 9, 2024 •

edited

kachawla commented Mar 9, 2024

rynowak commented Mar 9, 2024

kachawla commented Mar 10, 2024

rynowak commented Mar 10, 2024

kachawla commented Mar 12, 2024 •

edited

rynowak commented Mar 12, 2024 •

edited

kachawla commented Mar 12, 2024 •

edited

Scheduled long running test failed - Run ID: 8150844893 #7264

Scheduled long running test failed - Run ID: 8150844893 #7264

Comments

rad-ci-bot commented Mar 5, 2024 • edited by azure-boards bot

Bug information

radius-triage-bot bot commented Mar 5, 2024

kachawla commented Mar 5, 2024

kachawla commented Mar 6, 2024

kachawla commented Mar 7, 2024

kachawla commented Mar 9, 2024 • edited

kachawla commented Mar 9, 2024

rynowak commented Mar 9, 2024

kachawla commented Mar 10, 2024

rynowak commented Mar 10, 2024

kachawla commented Mar 12, 2024 • edited

rynowak commented Mar 12, 2024 • edited

kachawla commented Mar 12, 2024 • edited

rad-ci-bot commented Mar 5, 2024 •

edited by azure-boards bot

kachawla commented Mar 9, 2024 •

edited

kachawla commented Mar 12, 2024 •

edited

rynowak commented Mar 12, 2024 •

edited

kachawla commented Mar 12, 2024 •

edited