Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: k8s agent fails to tail logs starving the cpu #3601

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

fernandrone
Copy link
Contributor

@fernandrone fernandrone commented Apr 8, 2024

Proposal to
fixes #2253
fixes #3468

We have observed several possibly-related issues on a Kubernetes backend:

  1. Agents behave erractly when dealing with certain log payloads. A common observation here is that steps that produce a large volume of logs will cause some steps to be stuck "pending" forever.

  2. Agents use way more CPU than should be expected, we often see 200-300 millicores of CPU per Workflow per agent (as reported on k8s agent fails to tail step logs #2253).

  3. We commonly see Agents displaying thousands of error lines about parsing logs, often with very close timestamps, which may explain issues 1 and 2 (as reported on k8s agent fails to tail step logs #2253).

{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
  1. We've also observed that agents will sometimes drop out of the worker queue, also as reported on k8s agent fails to tail step logs #2253.

Seeing as the logs point to client_grpc.go:335, this pull request fixes the issue by:

  1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC calls that fail with codes. Internal will not be retried. There's not an agreement on what GRPC codes should be retried but Internal does not seem to be a common one to retry -- if ever.

  2. Add a timeout of 30 seconds to any retries. Currently, the exponential retries have a maximum timeout of 15 minutes. I assume this might be required by some other functions so Agents resume their operation in case the webserver restarts. Still this is likely the cause behind the large cpu increase as agents can be stuck trying thousands of requests for a large windown of time. The previous change alone should be enough to solve this issue but I think this might be a good idea to prevent similar problems from arising in the future.

Proposal to fix woodpecker-ci#2253

We have observed several possibly-related issues on a Kubernetes
backend:

1. Agents behave erractly when dealing with certain log payloads. A common
   observation here is that steps that produce a large volume of logs will cause
   some steps to be stuck "pending" forever.

2. Agents use way more CPU than should be expected, we often see 200-300
   millicores of CPU per Workflow per agent (as reported on woodpecker-ci#2253).

3. We commonly see Agents displaying thousands of error lines about
   parsing logs, often with very close timestamps, which may explain issues 1
   and 2 (as reported on woodpecker-ci#2253).

```
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-04-05T21:32:25Z","caller":"/src/agent/rpc/client_grpc.go:335","message":"grpc error: log(): code: Internal"}
```

4. We've also observed that agents will sometimes drop out of the worker queue,
also as reported on woodpecker-ci#2253.

Seeing as the logs point to `client_grpc.go:335`, this pull request
fixes the issue by:

1. Removing codes.Internal from being a retryable GRPC status. Now agent GRPC
calls that fail with codes. Internal will not be retried. There's not an
agreement on what GRPC codes should be retried but Internal does not seem to be
a common one to retry -- if ever.

2. Add a timeout of 30 seconds to any retries. Currently, the exponential
retries have a maximum timeout of _15 minutes_. I assume this might be
required by some other functions so Agents resume their operation in
case the webserver restarts. Still this is likely the cause behind the
large cpu increase as agents can be stuck trying thousands of requests for
a large windown of time. The previous change alone should be enough to
solve this issue but I think this might be a good idea to prevent
similar problems from arising in the future.
@fernandrone
Copy link
Contributor Author

fernandrone commented Apr 8, 2024

EDIT: one thing to note is that this does not correct the parsing errors. It's not clear to me what causes this. I have seen it come from numerous different pipelines. All I'm doing here is reducing the blast radius by avoiding retries that don't seem to accomplish much... Another consideration is that the code I'm modifying is 7 years old. Despite this I haven't ever observed anything like it when running Drone 0.4 to 0.8.5 on EC2. So I guess this could be either some interaction w/ something new introduced on the Woodpecker codebase or just some special interaction w/ some Kubernetes infrastructure. Hard to say.

Also, some more information about the issue and testing. We have deployed this in our infrastructure this Friday night (running Woodpecker 2.3.1 with 20 agents and 2 workflows per agent at the time).

At the time, by the end of the day most agents would be stuck using 500mCPU (which was the Limit we had configured for them at Kubernetes). As one can imagine a total of 20 agents using 500mCPU each to run a maximum of 40 workflows is a huge cost to pay for infrastructure alone (and in this case the CPU remained even when no pipelines were running; it was all caused by agents trying to process logs of steps that had already been finished).

After this fix most of these agents now use less than 20mCPU, none has peaked over 50m. That's roughly a 20x decrease in resource consumption.

Screenshot of our internal Grafana before/after the fix (we've ran dozens of very log-heavy pipelines after the fix to guarantee that the fix worked, plus we had a whole day of pipelines running this Monday, and the performance gains remain).

image

@qwerty287 qwerty287 added bug Something isn't working backend/kubernetes labels Apr 9, 2024
@eliasscosta
Copy link
Contributor

We notice this fix resolves another issue. #3468

@qwerty287 qwerty287 added this to the 2.5.0 milestone Apr 9, 2024
@pat-s
Copy link
Contributor

pat-s commented Apr 12, 2024

@fernandrone Thanks for the deep dive and the work you put in!

I don't have any relation to codes.Internal and hence can't really infer if this change would also come with some downsides. Yet the upsides you report are tremendous and solely from reading your description, it all makes sense.

There's still a lint error in the tests though.

Overall this needs a review from somebody who can judge the code changes better.

@fernandrone
Copy link
Contributor Author

I don't have any relation to codes.Internal and hence can't really infer if this change would also come with some downsides

I get that -- and feel the same way. The best I can say here is that we've been running this for a week and haven't experienced any bugs.

In case anyone has any cold feet about removing code.Internal, there's a middle ground where we refrain from removing code.Internal but keep the new timeout. That might be enough (we'd have to test it).

Will look into the lint errors.

@anbraten
Copy link
Member

anbraten commented Apr 15, 2024

In case anyone has any cold feet about removing code.Internal, there's a middle ground where we refrain from removing code.Internal but keep the new timeout. That might be enough (we'd have to test it).

If I get the issue correctly, sending a log line via gprc is failing somehow and we get an internal error code. Do you have an idea why this error is originally thrown? Are those logs still saved in case we remove the check? Sounds to me like as there is an issue with the way Woodpecker is dealing with logs in general 🤔

@fernandrone
Copy link
Contributor Author

If I get the issue correctly, sending a log line via gprc is failing somehow and we get an internal error code.

That's my interpretation too.

Do you have an idea why this error is originally thrown?

Not really. We've tried to track down which workflow could cause it, thinking it would be some kind of non utf-8 character that's breaking it. But we haven't been succesful in finding their origin.

There's a similar error that's happening somewhere else:

{"level":"error","repo":"<redacted>","pipeline":"11591","id":"76915","image":"docker.io/woodpeckerci/plugin-git:2.4.0","workflowID":"76915","error":"io: read/write on closed pipe","time":"2024-04-15T14:02:22Z","caller":"/src/agent/logger.go:46","message":"copy limited logStream part"}

However, this error has not been affected by the fix, it seems to happen with the same frequency (so they are likely unrelated) and it seems to affect differnt kinds of steps/images/workflows at random (so we can't pin it to a specific malformed log).

@eliasscosta found this explanation https://stackoverflow.com/a/76769569 (When deserializing a string field the Go protobuf code validate that strings are UTF-8 ... As far as I am aware the simplest way around this is to edit the .proto and change the string fields to bytes).

But it just reinforces the notion that at some point there's a log line that can't be parsed to UTF-8, but I don't know why that is.

If that's correct then changing something here from "string" to "bytes" might fix it https://github.com/woodpecker-ci/woodpecker/blob/main/pipeline/rpc/proto/woodpecker.proto (likely LogEntry.data given the problem). But I don't have enough experience with RPC to know the possible implications of migrating string to bytes.

This might be a better solution that mine, in fact. I also like that the woodpecker.proto file is relatively recent (lot's of edits in the past year). So if it's an issue that can be tracked to the woodpecker.proto file, it would explain why this bug exists in Woodpecker, but not on Drone 0.8 (the code I'm "fixing" however is 7 years old...).

Are those logs still saved in case we remove the check?

I haven't noticed any missing logs on our builds. However I think the true test would be to run a build that outputs non UTF-8 characters and see how Woodpecker behaves. Given the issue, I'd expect Woodpecker to not print them, neither before nor after the fix (the fix would just stop Woodpecker from retrying).

@@ -98,7 +98,6 @@ func (c *client) Next(ctx context.Context, f rpc.Filter) (*rpc.Workflow, error)
codes.Aborted,
codes.DataLoss,
codes.DeadlineExceeded,
codes.Internal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained in the OP (not judging whether it's the right thing to do).

@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
Type: LogEntryStdout,
Line: w.num,
}
if err := w.peer.Log(context.Background(), line); err != nil {

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded values should at least be it's own const

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed jet

Copy link

codecov bot commented Apr 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 35.89%. Comparing base (225ddb5) to head (12b3f97).

❗ Current head 12b3f97 differs from pull request most recent head 6cf032f. Consider uploading reports for the commit 6cf032f to get more accurate results

Files Patch % Lines
pipeline/rpc/log_entry.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3601      +/-   ##
==========================================
- Coverage   36.11%   35.89%   -0.23%     
==========================================
  Files         255      231      -24     
  Lines       16261    15603     -658     
==========================================
- Hits         5872     5600     -272     
+ Misses       9949     9586     -363     
+ Partials      440      417      -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pat-s
Copy link
Contributor

pat-s commented Apr 18, 2024

While still not having experience in GRPC and friends, the change of this PR essentially removes the code parts which cause the log output issue entirely. This will help in the short term but I am not sure if the underlying issue is then only masked and (as already pointed out) other messages/outputs else will also be lost in the future. But again, I can't judge what Code.Internals is for and in which situations it would be important.

I use the k8s backend also a lot in a business context and some builds also have a lot of log output but I haven't faced these issues so far (which doesn't mean they don't exist).

@fernandrone Do you need PR images in this PR so you can use these for your instances? (I guess you already have your own custom images in use?).

Overall I can't judge this PR. I guess this needs a decision from @anbraten or @6543 in the end, as it seems that nobody else is able to judge this PR with confidence?

@6543
Copy link
Member

6543 commented Apr 18, 2024

well the review was not jet addressed

@6543 6543 modified the milestones: 2.5.0, 2.6.0 Apr 18, 2024
@anbraten
Copy link
Member

anbraten commented Apr 19, 2024

As removing Code.Internals is just ignoring the actual error instead of avoiding it, I would like to keep it. I think the timeout part can be added with a TODO / HACK comment as a "worst-case" loophole.

After that the log streaming in general seems to need some performance work and #2072 by @6543 looks like a great start (probably solving other issues as well).

@fernandrone
Copy link
Contributor Author

FYI I was a bit short on time these past few weeks. I'll try to go back to this next week and revert back the Code.Internals.

I will try out this idea later as well:

If that's correct then changing something here from "string" to "bytes" might fix it https://github.com/woodpecker-ci/woodpecker/blob/main/pipeline/rpc/proto/woodpecker.proto (likely LogEntry.data given the problem). But I don't have enough experience with RPC to know the possible implications of migrating string to bytes.

And see if that fixes the error, instead of just supressing.

@anbraten
Copy link
Member

anbraten commented May 2, 2024

If you allow maintainers access to your PR we could also support you a bit.

@fernandrone
Copy link
Contributor Author

fernandrone commented May 2, 2024

Unfortunately I can't seem to enable this, apparently because I created this fork from an organization: https://github.com/orgs/community/discussions/5634

FWIW I'm considering using a personal fork for future contributions.

@@ -88,7 +88,11 @@ func (w *LineWriter) Write(p []byte) (n int, err error) {
Type: LogEntryStdout,
Line: w.num,
}
if err := w.peer.Log(context.Background(), line); err != nil {

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) //nolint: gomnd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed jet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing pipeline status when pod complete too fast on k8s backend k8s agent fails to tail step logs
7 participants