-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stats monitor for calculating UsageNanoCores periodically #10010
base: main
Are you sure you want to change the base?
Conversation
4714878
to
8a7dd89
Compare
Signed-off-by: James Sturtevant <jstur@microsoft.com> Signed-off-by: James Sturtevant <jsturtevant@gmail.com> Signed-off-by: James Sturtevant <jstur@microsoft.com>
8a7dd89
to
7fe6d03
Compare
I've run this against the failing windows k8s e2e test and it is passing:
|
/assign @kiashok |
|
||
func TestGetUsageNanoCores(t *testing.T) { | ||
timestamp := time.Now() | ||
secondAfterTimeStamp := timestamp.Add(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add another test for microseconds or emulate parallel requests to guarantee the old behavior do not appear as a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a test where it is one milli second afterward to make sure the calculation would make that different value.
I thought I added a test where it made sure the cached value was returned, but maybe I missed it...
func newMetricMonitor(c *criService) *metricMonitor { | ||
return &metricMonitor{ | ||
c: c, | ||
collectionPeriod: 10 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an arbitrary value (inherited from another scenario). Should this become configurable in the long term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like a breaking change when you consider n-3 versions of kubelet and though kubelet is the primary we have other users... let's do it with a config from the start pls.
For example monitor CPU Usage when set to 0 would be cache disabled on demand only as before. 10s default.. hmm
@@ -285,6 +288,9 @@ func (c *criService) Run(ready func()) error { | |||
}() | |||
} | |||
|
|||
log.L.Info("Starting metric cache provider") | |||
c.metricMonitor.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair to have this monitor.. but I think it should be configured...
Also not clear to me that the conversation in the kubelet issue came to agreement on implementation details.. IE. whether or not this would become an on update process (push model?) and/or how intermediate changes would affect in the field CRI users. Obviously if the stats are only updated once every 10s.. and they are pulling out of cycle every 10s they could be "behind" by 10s. Also if the pod only lives for 9s..
What if the user does not want to pay for this 10s heartbeat thread? May affect $$ when the monitoring thread keeps alive otherwise idle nodes without having some method to disable. Not to say we don't already have keep alive monitoring probe's just sayin, does this make sense to always be on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not clear to me that the conversation in the kubelet issue came to agreement on implementation details..
that's fair, it seems this needs more discussion and I will add it to the sig-node agenda again.
Also if the pod only lives for 9s..
This is something that isn't handled today I think. The kubelet code has a comment that If no caller calls // this function, the cpu usage will stay nil
Obviously if the stats are only updated once every 10s.. and they are pulling out of cycle every 10s they could be "behind" by 10s.
The only stat that is "cached" now is the Usagenanocores
as the CRI spec defines this as
// Total CPU usage (sum of all cores) averaged over the sample window.
// The "core" unit can be interpreted as CPU core-nanoseconds per second.
The "sample window" was hard coded to 10s in kubelet but it isn't defined in the cri definition and because it isn't defined here it is the source of the problem. It needs to clearer who is responsible for generating the value over the "sample window".
Sounds good, wasn't sure if wanted to start with this but will make those updates, once we have more clarity |
Followed up with @mikebrow and @haircommander in a short call:
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
fixes: #9531 and kubernetes/kubernetes#122092
UsageNanoCores
should be calculated on a regular interval. This was previously maintained by kubelet but since moving the stats down to the CRI layer this causes issues. See kubernetes/kubernetes#122092 (comment) for details.