-
-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: k8s agent fails to tail logs starving the cpu #3601
base: main
Are you sure you want to change the base?
fix: k8s agent fails to tail logs starving the cpu #3601
Conversation
Proposal to fix woodpecker-ci#2253 We have observed several possibly-related issues on a Kubernetes backend: 1. Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever. 2. Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on woodpecker-ci#2253). 3. We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on woodpecker-ci#2253). ``` {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} ``` 4. We've also observed that agents will sometimes drop out of the worker queue, also as reported on woodpecker-ci#2253. Seeing as the logs point to `client_grpc.go:335`, this pull request fixes the issue by: 1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever. 2. Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of _15 minutes_. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.
EDIT: one thing to note is that this does not correct the parsing errors. It's not clear to me what causes this. I have seen it come from numerous different pipelines. All I'm doing here is reducing the blast radius by avoiding retries that don't seem to accomplish much... Another consideration is that the code I'm modifying is 7 years old. Despite this I haven't ever observed anything like it when running Drone 0.4 to 0.8.5 on EC2. So I guess this could be either some interaction w/ something new introduced on the Woodpecker codebase or just some special interaction w/ some Kubernetes infrastructure. Hard to say. Also, some more information about the issue and testing. We have deployed this in our infrastructure this Friday night (running Woodpecker 2.3.1 with 20 agents and 2 workflows per agent at the time). At the time, by the end of the day most agents would be stuck using 500mCPU (which was the Limit we had configured for them at Kubernetes). As one can imagine a total of 20 agents using 500mCPU each to run a maximum of 40 workflows is a huge cost to pay for infrastructure alone (and in this case the CPU remained even when no pipelines were running; it was all caused by agents trying to process logs of steps that had already been finished). After this fix most of these agents now use less than 20mCPU, none has peaked over 50m. That's roughly a 20x decrease in resource consumption. Screenshot of our internal Grafana before/after the fix (we've ran dozens of very log-heavy pipelines after the fix to guarantee that the fix worked, plus we had a whole day of pipelines running this Monday, and the performance gains remain). |
We notice this fix resolves another issue. #3468 |
@fernandrone Thanks for the deep dive and the work you put in! I don't have any relation to There's still a lint error in the tests though. Overall this needs a review from somebody who can judge the code changes better. |
I get that -- and feel the same way. The best I can say here is that we've been running this for a week and haven't experienced any bugs. In case anyone has any cold feet about removing code.Internal, there's a middle ground where we refrain from removing code.Internal but keep the new timeout. That might be enough (we'd have to test it). Will look into the lint errors. |
If I get the issue correctly, sending a log line via gprc is failing somehow and we get an internal error code. Do you have an idea why this error is originally thrown? Are those logs still saved in case we remove the check? Sounds to me like as there is an issue with the way Woodpecker is dealing with logs in general 🤔 |
That's my interpretation too.
Not really. We've tried to track down which workflow could cause it, thinking it would be some kind of non utf-8 character that's breaking it. But we haven't been succesful in finding their origin. There's a similar error that's happening somewhere else:
However, this error has not been affected by the fix, it seems to happen with the same frequency (so they are likely unrelated) and it seems to affect differnt kinds of steps/images/workflows at random (so we can't pin it to a specific malformed log). @eliasscosta found this explanation https://stackoverflow.com/a/76769569 (When deserializing a string field the Go protobuf code validate that strings are UTF-8 ... As far as I am aware the simplest way around this is to edit the .proto and change the string fields to bytes). But it just reinforces the notion that at some point there's a log line that can't be parsed to UTF-8, but I don't know why that is. If that's correct then changing something here from "string" to "bytes" might fix it https://github.com/woodpecker-ci/woodpecker/blob/main/pipeline/rpc/proto/woodpecker.proto (likely LogEntry.data given the problem). But I don't have enough experience with RPC to know the possible implications of migrating string to bytes. This might be a better solution that mine, in fact. I also like that the woodpecker.proto file is relatively recent (lot's of edits in the past year). So if it's an issue that can be tracked to the woodpecker.proto file, it would explain why this bug exists in Woodpecker, but not on Drone 0.8 (the code I'm "fixing" however is 7 years old...).
I haven't noticed any missing logs on our builds. However I think the true test would be to run a build that outputs non UTF-8 characters and see how Woodpecker behaves. Given the issue, I'd expect Woodpecker to not print them, neither before nor after the fix (the fix would just stop Woodpecker from retrying). |
@@ -98,7 +98,6 @@ func (c *client) Next(ctx context.Context, f rpc.Filter) (*rpc.Workflow, error) | |||
codes.Aborted, | |||
codes.DataLoss, | |||
codes.DeadlineExceeded, | |||
codes.Internal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained in the OP (not judging whether it's the right thing to do).
@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) { | |||
Type: LogEntryStdout, | |||
Line: w.num, | |||
} | |||
if err := w.peer.Log(context.Background(), line); err != nil { | |||
|
|||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded values should at least be it's own const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed jet
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3601 +/- ##
==========================================
- Coverage 36.11% 35.89% -0.23%
==========================================
Files 255 231 -24
Lines 16261 15603 -658
==========================================
- Hits 5872 5600 -272
+ Misses 9949 9586 -363
+ Partials 440 417 -23 ☔ View full report in Codecov by Sentry. |
While still not having experience in GRPC and friends, the change of this PR essentially removes the code parts which cause the log output issue entirely. This will help in the short term but I am not sure if the underlying issue is then only masked and (as already pointed out) other messages/outputs else will also be lost in the future. But again, I can't judge what I use the k8s backend also a lot in a business context and some builds also have a lot of log output but I haven't faced these issues so far (which doesn't mean they don't exist). @fernandrone Do you need PR images in this PR so you can use these for your instances? (I guess you already have your own custom images in use?). Overall I can't judge this PR. I guess this needs a decision from @anbraten or @6543 in the end, as it seems that nobody else is able to judge this PR with confidence? |
well the review was not jet addressed |
As removing After that the log streaming in general seems to need some performance work and #2072 by @6543 looks like a great start (probably solving other issues as well). |
FYI I was a bit short on time these past few weeks. I'll try to go back to this next week and revert back the Code.Internals. I will try out this idea later as well:
And see if that fixes the error, instead of just supressing. |
If you allow maintainers access to your PR we could also support you a bit. |
Unfortunately I can't seem to enable this, apparently because I created this fork from an organization: https://github.com/orgs/community/discussions/5634 FWIW I'm considering using a personal fork for future contributions. |
@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) { | |||
Type: LogEntryStdout, | |||
Line: w.num, | |||
} | |||
if err := w.peer.Log(context.Background(), line); err != nil { | |||
|
|||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed jet
Proposal to
fixes #2253
fixes #3468
We have observed several possibly-related issues on a Kubernetes backend:
Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever.
Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on k8s agent fails to tail step logs #2253).
We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on k8s agent fails to tail step logs #2253).
Seeing as the logs point to
client_grpc.go:335
, this pull request fixes the issue by:Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever.
Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of 15 minutes. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.