Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resctrl collector fixes #3326

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Resctrl collector fixes #3326

wants to merge 4 commits into from

Conversation

JulSenko
Copy link

Resctrl collector fixes:

  • Resume stat collection if collector err was resolved. From the code looks like it was intended to do so, but enabled=true was missing.
  • Rely on libcontainer funcs to collect pids/threads instead of executing ps call which is very expensive.
  • Collect threads instead of pids. During collector startup we collect threads, so for the following checks we should also collect threads, not pids. Pid approach woks on relatively simple processes, but for any complex one pid list will never be the same as the original thread list, thus resulting in err on every collector run.

@k8s-ci-robot
Copy link
Collaborator

Hi @JulSenko. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

monGroupPath = filepath.Join(controlGroupPath, monitoringGroupDir, properContainerName)
rmErr := os.Remove(monGroupPath)
if rmErr != nil && !os.IsNotExist(rmErr) {
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap the error, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Errorf not enough? Do you have in mind specific error type? Please let me know how to improve this error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

if existingPath != monGroupPath {
rmErr = os.Remove(existingPath)
if rmErr != nil && !os.IsNotExist(rmErr) {
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap the error, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have in mind specific error type? Please let me know how to improve this error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

resctrl/utils.go Show resolved Hide resolved
resctrl/utils.go Show resolved Hide resolved
@@ -233,7 +251,7 @@ func findGroup(group string, pids []string, includeGroup bool, exclusive bool) (
for _, path := range availablePaths {
groupFound, err := arePIDsInGroup(path, pids, exclusive)
if err != nil {
return "", err
return path, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to return real value instead of zero value when you return an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal, but we need path value for additional check later if error type is errNotEnoughPIDs or errTooManyPIDs. E.g.: group could have changed completely and we need to recreate it or that path is correct, but some threads died / were spawned anew.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be information about an error, then it should be part of error struct rather then value returned from a function. I would expect you to check if err == errNotEnoughPIDs or if err == errTooManyPIDs and act accordingly. Interface that you propose is difficult to grasp, in my opinion.

resctrl/utils_test.go Outdated Show resolved Hide resolved
resctrl/utils.go Outdated Show resolved Hide resolved
resctrl/utils.go Outdated Show resolved Hide resolved
Comment on lines -108 to +111
pids, err := getContainerPids()
pids, err := getPids(containerName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is getContainerPids() used (it does not seem to be)? If not, would it be possible to drop the argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be, but it touches quite a bunch of definitions. I'll follow up on this if that works for you:)

resctrl/utils.go Outdated
Comment on lines 123 to 130
var processThreads []string
for _, pid := range pids {
processThreads, err = getAllProcessThreads(filepath.Join(processPath, strconv.Itoa(pid), processTask))
if err != nil {
return "", err
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what are the benefits of this approach? As far as I understand (and I might be wrong) you extracted obtaining task IDs to separate loop, but the result will be the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now threads are extracted earlier and passed as param to the findGroup func, which previously accepted pids resulting in the incomplete list

@iwankgb
Copy link
Collaborator

iwankgb commented Jun 18, 2023

@JulSenko, can you merge master to your branch, please?

if existingPath != monGroupPath {
rmErr = os.Remove(existingPath)
if rmErr != nil && !os.IsNotExist(rmErr) {
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

@@ -233,7 +251,7 @@ func findGroup(group string, pids []string, includeGroup bool, exclusive bool) (
for _, path := range availablePaths {
groupFound, err := arePIDsInGroup(path, pids, exclusive)
if err != nil {
return "", err
return path, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be information about an error, then it should be part of error struct rather then value returned from a function. I would expect you to check if err == errNotEnoughPIDs or if err == errTooManyPIDs and act accordingly. Interface that you propose is difficult to grasp, in my opinion.

Comment on lines -149 to +192

if !inHostNamespace {
processPath = "/rootfs/proc"
}

for _, pid := range pids {
processThreads, err := getAllProcessThreads(filepath.Join(processPath, pid, processTask))
for _, thread := range processThreads {
treadInt, err := strconv.Atoi(thread)
if err != nil {
return "", err
return "", fmt.Errorf("couldn't parse %q: %w", thread, err)
}
for _, thread := range processThreads {
err = intelrdt.WriteIntelRdtTasks(monGroupPath, thread)
if err != nil {
secondError := os.Remove(monGroupPath)
if secondError != nil {
return "", fmt.Errorf(
"coudn't assign pids to %q container monitoring group: %w \n couldn't clear %q monitoring group: %v",
containerName, err, containerName, secondError)
}
return "", fmt.Errorf("coudn't assign pids to %q container monitoring group: %w", containerName, err)
err = intelrdt.WriteIntelRdtTasks(monGroupPath, treadInt)
if err != nil {
secondError := os.Remove(monGroupPath)
if secondError != nil {
return "", fmt.Errorf(
"coudn't assign pids to %q container monitoring group: %w \n couldn't clear %q monitoring group: %v",
containerName, err, containerName, secondError)
}
return "", fmt.Errorf("coudn't assign pids to %q container monitoring group: %w", containerName, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but:

  • The Old Way:
    1. Iterate over all pids.
    2. Fetch all the tids for a pid
    3. Iterate over all tids
    4. Write the tid to monitoring group.
    5. Exit tids loop.
    6. Exit pids loop.
  • The New Way:
    1. Iterate over all pids.
    2. Save all the tids for a pid to a variable.
    3. Exit pids loop.
    4. Iterate over all tids.
    5. Write the tid to monitoring group.
    6. Exit tids loop.

Functionally these two approaches are identical. In the PR description you wrote:

Collect threads instead of pids. During collector startup we collect threads, so for the following checks we should also collect threads, not pids. Pid approach woks on relatively simple processes, but for any complex one pid list will never be the same as the original thread list, thus resulting in err on every collector run.

, but threads (tids) have always been collected and written to monitoring group.

I might be missing something but I can't understand where the bug that you are trying to fix is. I will appreciate more detailed explanation that will help me to understand your reasoning. A test case failing with The Old Way and passing with The New Way would be perfect!

@iwankgb
Copy link
Collaborator

iwankgb commented Jul 20, 2023

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants