fix: k8s agent fails to tail logs starving the cpu #3601

fernandrone · 2024-04-08T19:45:19Z

Proposal to
fixes #2253
fixes #3468

We have observed several possibly-related issues on a Kubernetes backend:

Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever.
Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on k8s agent fails to tail step logs #2253).
We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on k8s agent fails to tail step logs #2253).

{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}

We've also observed that agents will sometimes drop out of the worker queue, also as reported on k8s agent fails to tail step logs #2253.

Seeing as the logs point to client_grpc.go:335, this pull request fixes the issue by:

Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever.
Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of 15 minutes. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.

Proposal to fix woodpecker-ci#2253 We have observed several possibly-related issues on a Kubernetes backend: 1. Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever. 2. Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on woodpecker-ci#2253). 3. We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on woodpecker-ci#2253). ``` {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} {"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"} ``` 4. We've also observed that agents will sometimes drop out of the worker queue, also as reported on woodpecker-ci#2253. Seeing as the logs point to `client_grpc.go:335`, this pull request fixes the issue by: 1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever. 2. Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of _15 minutes_. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.

fernandrone · 2024-04-08T19:53:01Z

EDIT: one thing to note is that this does not correct the parsing errors. It's not clear to me what causes this. I have seen it come from numerous different pipelines. All I'm doing here is reducing the blast radius by avoiding retries that don't seem to accomplish much... Another consideration is that the code I'm modifying is 7 years old. Despite this I haven't ever observed anything like it when running Drone 0.4 to 0.8.5 on EC2. So I guess this could be either some interaction w/ something new introduced on the Woodpecker codebase or just some special interaction w/ some Kubernetes infrastructure. Hard to say.

Also, some more information about the issue and testing. We have deployed this in our infrastructure this Friday night (running Woodpecker 2.3.1 with 20 agents and 2 workflows per agent at the time).

At the time, by the end of the day most agents would be stuck using 500mCPU (which was the Limit we had configured for them at Kubernetes). As one can imagine a total of 20 agents using 500mCPU each to run a maximum of 40 workflows is a huge cost to pay for infrastructure alone (and in this case the CPU remained even when no pipelines were running; it was all caused by agents trying to process logs of steps that had already been finished).

After this fix most of these agents now use less than 20mCPU, none has peaked over 50m. That's roughly a 20x decrease in resource consumption.

Screenshot of our internal Grafana before/after the fix (we've ran dozens of very log-heavy pipelines after the fix to guarantee that the fix worked, plus we had a whole day of pipelines running this Monday, and the performance gains remain).

eliasscosta · 2024-04-09T12:29:01Z

We notice this fix resolves another issue. #3468

pat-s · 2024-04-12T13:48:19Z

@fernandrone Thanks for the deep dive and the work you put in!

I don't have any relation to codes.Internal and hence can't really infer if this change would also come with some downsides. Yet the upsides you report are tremendous and solely from reading your description, it all makes sense.

There's still a lint error in the tests though.

Overall this needs a review from somebody who can judge the code changes better.

fernandrone · 2024-04-12T14:29:49Z

I don't have any relation to codes.Internal and hence can't really infer if this change would also come with some downsides

I get that -- and feel the same way. The best I can say here is that we've been running this for a week and haven't experienced any bugs.

In case anyone has any cold feet about removing code.Internal, there's a middle ground where we refrain from removing code.Internal but keep the new timeout. That might be enough (we'd have to test it).

Will look into the lint errors.

anbraten · 2024-04-15T06:45:32Z

In case anyone has any cold feet about removing code.Internal, there's a middle ground where we refrain from removing code.Internal but keep the new timeout. That might be enough (we'd have to test it).

If I get the issue correctly, sending a log line via gprc is failing somehow and we get an internal error code. Do you have an idea why this error is originally thrown? Are those logs still saved in case we remove the check? Sounds to me like as there is an issue with the way Woodpecker is dealing with logs in general 🤔

fernandrone · 2024-04-15T14:18:25Z

If I get the issue correctly, sending a log line via gprc is failing somehow and we get an internal error code.

That's my interpretation too.

Do you have an idea why this error is originally thrown?

Not really. We've tried to track down which workflow could cause it, thinking it would be some kind of non utf-8 character that's breaking it. But we haven't been succesful in finding their origin.

There's a similar error that's happening somewhere else:

{"level":"error","repo":"<redacted>","pipeline":"11591","id":"76915","image":"docker.io/woodpeckerci/plugin-git:2.4.0","workflowID":"76915","error":"io: read/write on closed pipe","time":"2024-04-15T14:02:22Z","caller":"/src/agent/logger.go:46","message":"copy limited logStream part"}

However, this error has not been affected by the fix, it seems to happen with the same frequency (so they are likely unrelated) and it seems to affect differnt kinds of steps/images/workflows at random (so we can't pin it to a specific malformed log).

@eliasscosta found this explanation https://stackoverflow.com/a/76769569 (When deserializing a string field the Go protobuf code validate that strings are UTF-8 ... As far as I am aware the simplest way around this is to edit the .proto and change the string fields to bytes).

But it just reinforces the notion that at some point there's a log line that can't be parsed to UTF-8, but I don't know why that is.

If that's correct then changing something here from "string" to "bytes" might fix it https://github.com/woodpecker-ci/woodpecker/blob/main/pipeline/rpc/proto/woodpecker.proto (likely LogEntry.data given the problem). But I don't have enough experience with RPC to know the possible implications of migrating string to bytes.

This might be a better solution that mine, in fact. I also like that the woodpecker.proto file is relatively recent (lot's of edits in the past year). So if it's an issue that can be tracked to the woodpecker.proto file, it would explain why this bug exists in Woodpecker, but not on Drone 0.8 (the code I'm "fixing" however is 7 years old...).

Are those logs still saved in case we remove the check?

I haven't noticed any missing logs on our builds. However I think the true test would be to run a build that outputs non UTF-8 characters and see how Woodpecker behaves. Given the issue, I'd expect Woodpecker to not print them, neither before nor after the fix (the fix would just stop Woodpecker from retrying).

lafriks · 2024-04-15T14:53:18Z

agent/rpc/client_grpc.go

@@ -98,7 +98,6 @@ func (c *client) Next(ctx context.Context, f rpc.Filter) (*rpc.Workflow, error)
 			codes.Aborted,
 			codes.DataLoss,
 			codes.DeadlineExceeded,
-			codes.Internal,


why this change? 🤔

This is explained in the OP (not judging whether it's the right thing to do).

6543 · 2024-04-15T16:04:42Z

pipeline/rpc/log_entry.go

@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
 		Type:     LogEntryStdout,
 		Line:     w.num,
 	}
-	if err := w.peer.Log(context.Background(), line); err != nil {
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd


hardcoded values should at least be it's own const

Not addressed jet

codecov · 2024-04-15T16:29:23Z

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 35.89%. Comparing base (225ddb5) to head (12b3f97).

❗ Current head 12b3f97 differs from pull request most recent head 6cf032f. Consider uploading reports for the commit 6cf032f to get more accurate results

Files	Patch %	Lines
pipeline/rpc/log_entry.go	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3601      +/-   ##
==========================================
- Coverage   36.11%   35.89%   -0.23%     
==========================================
  Files         255      231      -24     
  Lines       16261    15603     -658     
==========================================
- Hits         5872     5600     -272     
+ Misses       9949     9586     -363     
+ Partials      440      417      -23

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pat-s · 2024-04-18T07:47:48Z

While still not having experience in GRPC and friends, the change of this PR essentially removes the code parts which cause the log output issue entirely. This will help in the short term but I am not sure if the underlying issue is then only masked and (as already pointed out) other messages/outputs else will also be lost in the future. But again, I can't judge what Code.Internals is for and in which situations it would be important.

I use the k8s backend also a lot in a business context and some builds also have a lot of log output but I haven't faced these issues so far (which doesn't mean they don't exist).

@fernandrone Do you need PR images in this PR so you can use these for your instances? (I guess you already have your own custom images in use?).

Overall I can't judge this PR. I guess this needs a decision from @anbraten or @6543 in the end, as it seems that nobody else is able to judge this PR with confidence?

6543 · 2024-04-18T14:22:44Z

well the review was not jet addressed

anbraten · 2024-04-19T08:28:02Z

As removing Code.Internals is just ignoring the actual error instead of avoiding it, I would like to keep it. I think the timeout part can be added with a TODO / HACK comment as a "worst-case" loophole.

After that the log streaming in general seems to need some performance work and #2072 by @6543 looks like a great start (probably solving other issues as well).

fernandrone · 2024-05-02T14:23:29Z

FYI I was a bit short on time these past few weeks. I'll try to go back to this next week and revert back the Code.Internals.

I will try out this idea later as well:

If that's correct then changing something here from "string" to "bytes" might fix it https://github.com/woodpecker-ci/woodpecker/blob/main/pipeline/rpc/proto/woodpecker.proto (likely LogEntry.data given the problem). But I don't have enough experience with RPC to know the possible implications of migrating string to bytes.

And see if that fixes the error, instead of just supressing.

anbraten · 2024-05-02T15:06:47Z

If you allow maintainers access to your PR we could also support you a bit.

fernandrone · 2024-05-02T16:53:02Z

Unfortunately I can't seem to enable this, apparently because I created this fork from an organization: https://github.com/orgs/community/discussions/5634

FWIW I'm considering using a personal fork for future contributions.

6543 · 2024-05-02T17:25:36Z

pipeline/rpc/log_entry.go

@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
 		Type:     LogEntryStdout,
 		Line:     w.num,
 	}
-	if err := w.peer.Log(context.Background(), line); err != nil {
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd


Not addressed jet

fernandrone mentioned this pull request Apr 8, 2024

k8s agent fails to tail step logs #2253

Open

5 tasks

qwerty287 added bug Something isn't working backend/kubernetes labels Apr 9, 2024

qwerty287 added this to the 2.5.0 milestone Apr 9, 2024

fernandrone added 2 commits April 9, 2024 19:14

Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock

85d03a6

Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock

8fd8b8f

Add //nolint: gomnd to peer.Log context

6a063b6

Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock

12b3f97

lafriks reviewed Apr 15, 2024

View reviewed changes

6543 reviewed Apr 15, 2024

View reviewed changes

6543 removed the backend/kubernetes label Apr 15, 2024

6543 modified the milestones: 2.5.0, 2.6.0 Apr 18, 2024

Merge branch 'main' into woodpecker-fix-log-tail-cpu-lock

6cf032f

6543 requested changes May 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: k8s agent fails to tail logs starving the cpu #3601

fix: k8s agent fails to tail logs starving the cpu #3601

fernandrone commented Apr 8, 2024 •

edited by anbraten

fernandrone commented Apr 8, 2024 •

edited

eliasscosta commented Apr 9, 2024

pat-s commented Apr 12, 2024

fernandrone commented Apr 12, 2024

anbraten commented Apr 15, 2024 •

edited

fernandrone commented Apr 15, 2024

lafriks Apr 15, 2024

pat-s Apr 18, 2024

6543 Apr 15, 2024

6543 May 2, 2024

codecov bot commented Apr 15, 2024 •

edited

pat-s commented Apr 18, 2024

6543 commented Apr 18, 2024

anbraten commented Apr 19, 2024 •

edited

fernandrone commented May 2, 2024

anbraten commented May 2, 2024

fernandrone commented May 2, 2024 •

edited

6543 May 2, 2024

fix: k8s agent fails to tail logs starving the cpu #3601

Are you sure you want to change the base?

fix: k8s agent fails to tail logs starving the cpu #3601

Conversation

fernandrone commented Apr 8, 2024 • edited by anbraten

fernandrone commented Apr 8, 2024 • edited

eliasscosta commented Apr 9, 2024

pat-s commented Apr 12, 2024

fernandrone commented Apr 12, 2024

anbraten commented Apr 15, 2024 • edited

fernandrone commented Apr 15, 2024

lafriks Apr 15, 2024

Choose a reason for hiding this comment

pat-s Apr 18, 2024

Choose a reason for hiding this comment

6543 Apr 15, 2024

Choose a reason for hiding this comment

6543 May 2, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 15, 2024 • edited

Codecov Report

pat-s commented Apr 18, 2024

6543 commented Apr 18, 2024

anbraten commented Apr 19, 2024 • edited

fernandrone commented May 2, 2024

anbraten commented May 2, 2024

fernandrone commented May 2, 2024 • edited

6543 May 2, 2024

Choose a reason for hiding this comment

fernandrone commented Apr 8, 2024 •

edited by anbraten

fernandrone commented Apr 8, 2024 •

edited

anbraten commented Apr 15, 2024 •

edited

codecov bot commented Apr 15, 2024 •

edited

anbraten commented Apr 19, 2024 •

edited

fernandrone commented May 2, 2024 •

edited