-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduled long running test failed - Run ID: 8150844893 #7264
Comments
👋 @rad-ci-bot Thanks for filing this bug report. A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server. For more information on our triage process please visit our triage overview |
|
Can't seem to find this error in RP/UCP logs and traceid doesn't exist in the logs either. Based on the log, it should be coming from either https://github.com/radius-project/radius/blob/main/pkg/portableresources/processors/resourceclient.go#L245 or https://github.com/radius-project/radius/blob/main/pkg/corerp/handlers/kubernetes.go#L169, so should definitely show up in rp logs. Will continue looking into this tomorrow. |
Ended spending time debugging and fixing #7270 since that was blocking long haul test workflow completely. Will get back to this today. |
Looks like we aren't logging errors on failure for multiple requests paths, attaching logs for reference. Working on a fix for logging so that we have sufficient information to work with for future debugging, but at this point there isn't enough information to identify the root cause here. I did do a general lookup of this error and found this as a potential cause/fix: helm/helm#6361 (comment). This is the first time we have seen this error, we should look into this further if the error happens again. Also, since some of the failures like this one are not reproducible without a visibility into the cluster's state at the time of failure. To help with this, we should log status of all the pods running in the cluster when a request fails. I'll create an issue for this. |
This is relevant because WE install an API service. Right now the we handle auth is by tunneling requests to Radius through the Kubernetes API server. If this is causing reliability problems or making upgrades difficult then we should choose another approach. We did this because it provides authentication and reachability (DNS/TLS). It isn't essential to the way that Radius works and we could choose different approaches. |
I didn't say it was irrelevant, but I don't have enough data points to suggest what needs to be done as I mentioned above. Please feel free to log an issue for what you are suggesting. Going to close this test failure issue as per my notes above. |
That sounds right to me. We don't have a lot of evidence either way. I wanted to mention it, because we did this a while back and I'm not sure it's a good idea. Since we did that, I noticed a lot of other projects that expose an API from Kubernetes choose a port-forward approach. |
Makes sense. Agree, we should revisit it, and if it starts creating operational pain then definitely needs to be done sooner than later. |
We're run into two issues so far, both were solvable. If we're causing unreliability of the API server, then that's strike three! |
I was agreeing with your previous comment. Are you suggesting we revisit the approach now? Could you share a link to the second issue - this one? |
Bug information
This bug is generated automatically if the scheduled long running test fails. The Radius long running test operates on a schedule of every 2 hours everyday. It's important to understand that the test may fail due to workflow infrastructure issues, like network problems, rather than the flakiness of the test itself. For the further investigation, please visit here.
AB#11363
The text was updated successfully, but these errors were encountered: