Resctrl collector fixes #3326

JulSenko · 2023-06-13T12:50:45Z

Resctrl collector fixes:

Resume stat collection if collector err was resolved. From the code looks like it was intended to do so, but enabled=true was missing.
Rely on libcontainer funcs to collect pids/threads instead of executing ps call which is very expensive.
Collect threads instead of pids. During collector startup we collect threads, so for the following checks we should also collect threads, not pids. Pid approach woks on relatively simple processes, but for any complex one pid list will never be the same as the original thread list, thus resulting in err on every collector run.

k8s-ci-robot · 2023-06-13T12:50:55Z

Hi @JulSenko. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iwankgb · 2023-06-18T15:31:34Z

resctrl/utils.go

-		monGroupPath = filepath.Join(controlGroupPath, monitoringGroupDir, properContainerName)
+		rmErr := os.Remove(monGroupPath)
+		if rmErr != nil && !os.IsNotExist(rmErr) {
+			return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)


Wrap the error, please.

Is Errorf not enough? Do you have in mind specific error type? Please let me know how to improve this error

Suggested change

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

iwankgb · 2023-06-18T15:31:50Z

resctrl/utils.go

+		if existingPath != monGroupPath {
+			rmErr = os.Remove(existingPath)
+			if rmErr != nil && !os.IsNotExist(rmErr) {
+				return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)


Wrap the error, please.

Do you have in mind specific error type? Please let me know how to improve this error

Suggested change

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

resctrl/utils.go

iwankgb · 2023-06-18T15:35:45Z

resctrl/utils.go

@@ -233,7 +251,7 @@ func findGroup(group string, pids []string, includeGroup bool, exclusive bool) (
 	for _, path := range availablePaths {
 		groupFound, err := arePIDsInGroup(path, pids, exclusive)
 		if err != nil {
-			return "", err
+			return path, err


Why do you want to return real value instead of zero value when you return an error?

This is not ideal, but we need path value for additional check later if error type is errNotEnoughPIDs or errTooManyPIDs. E.g.: group could have changed completely and we need to recreate it or that path is correct, but some threads died / were spawned anew.

If this is supposed to be information about an error, then it should be part of error struct rather then value returned from a function. I would expect you to check if err == errNotEnoughPIDs or if err == errTooManyPIDs and act accordingly. Interface that you propose is difficult to grasp, in my opinion.

resctrl/utils_test.go

resctrl/utils.go

iwankgb · 2023-06-18T16:14:35Z

resctrl/utils.go

-	pids, err := getContainerPids()
+	pids, err := getPids(containerName)


Is getContainerPids() used (it does not seem to be)? If not, would it be possible to drop the argument?

It should be, but it touches quite a bunch of definitions. I'll follow up on this if that works for you:)

iwankgb · 2023-06-18T16:21:24Z

resctrl/utils.go

+	var processThreads []string
+	for _, pid := range pids {
+		processThreads, err = getAllProcessThreads(filepath.Join(processPath, strconv.Itoa(pid), processTask))
+		if err != nil {
+			return "", err
+		}
+	}
+


Can you explain what are the benefits of this approach? As far as I understand (and I might be wrong) you extracted obtaining task IDs to separate loop, but the result will be the same.

Now threads are extracted earlier and passed as param to the findGroup func, which previously accepted pids resulting in the incomplete list

iwankgb · 2023-06-18T16:22:25Z

@JulSenko, can you merge master to your branch, please?

iwankgb · 2023-06-29T14:49:41Z

resctrl/utils.go

+		if existingPath != monGroupPath {
+			rmErr = os.Remove(existingPath)
+			if rmErr != nil && !os.IsNotExist(rmErr) {
+				return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)


Suggested change

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)

return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

iwankgb · 2023-06-29T16:52:59Z

resctrl/utils.go

@@ -233,7 +251,7 @@ func findGroup(group string, pids []string, includeGroup bool, exclusive bool) (
 	for _, path := range availablePaths {
 		groupFound, err := arePIDsInGroup(path, pids, exclusive)
 		if err != nil {
-			return "", err
+			return path, err


If this is supposed to be information about an error, then it should be part of error struct rather then value returned from a function. I would expect you to check if err == errNotEnoughPIDs or if err == errTooManyPIDs and act accordingly. Interface that you propose is difficult to grasp, in my opinion.

iwankgb · 2023-06-29T17:03:40Z

resctrl/utils.go

-
-		if !inHostNamespace {
-			processPath = "/rootfs/proc"
-		}
-
-		for _, pid := range pids {
-			processThreads, err := getAllProcessThreads(filepath.Join(processPath, pid, processTask))
+		for _, thread := range processThreads {
+			treadInt, err := strconv.Atoi(thread)
 			if err != nil {
-				return "", err
+				return "", fmt.Errorf("couldn't parse %q: %w", thread, err)
 			}
-			for _, thread := range processThreads {
-				err = intelrdt.WriteIntelRdtTasks(monGroupPath, thread)
-				if err != nil {
-					secondError := os.Remove(monGroupPath)
-					if secondError != nil {
-						return "", fmt.Errorf(
-							"coudn't assign pids to %q container monitoring group: %w \n couldn't clear %q monitoring group: %v",
-							containerName, err, containerName, secondError)
-					}
-					return "", fmt.Errorf("coudn't assign pids to %q container monitoring group: %w", containerName, err)
+			err = intelrdt.WriteIntelRdtTasks(monGroupPath, treadInt)
+			if err != nil {
+				secondError := os.Remove(monGroupPath)
+				if secondError != nil {
+					return "", fmt.Errorf(
+						"coudn't assign pids to %q container monitoring group: %w \n couldn't clear %q monitoring group: %v",
+						containerName, err, containerName, secondError)
 				}
+				return "", fmt.Errorf("coudn't assign pids to %q container monitoring group: %w", containerName, err)


Correct me if I'm wrong but:

The Old Way:

Iterate over all pids.

Fetch all the tids for a pid

Iterate over all tids

Write the tid to monitoring group.

Exit tids loop.

Exit pids loop.

The New Way:

Iterate over all pids.

Save all the tids for a pid to a variable.

Exit pids loop.

Iterate over all tids.

Write the tid to monitoring group.

Exit tids loop.

Functionally these two approaches are identical. In the PR description you wrote:

Collect threads instead of pids. During collector startup we collect threads, so for the following checks we should also collect threads, not pids. Pid approach woks on relatively simple processes, but for any complex one pid list will never be the same as the original thread list, thus resulting in err on every collector run.

, but threads (tids) have always been collected and written to monitoring group.

I might be missing something but I can't understand where the bug that you are trying to fix is. I will appreciate more detailed explanation that will help me to understand your reasoning. A test case failing with The Old Way and passing with The New Way would be perfect!

iwankgb · 2023-07-20T13:48:22Z

/ok-to-test

resctrl fix

85dc214

k8s-ci-robot added the needs-ok-to-test label Jun 13, 2023

use

65012c3

iwankgb reviewed Jun 18, 2023

View reviewed changes

JulSenko added 2 commits June 28, 2023 11:58

wip

24a1cbf

Merge branch 'master' into resctrl

e753e5b

iwankgb reviewed Jun 29, 2023

View reviewed changes

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resctrl collector fixes #3326

Resctrl collector fixes #3326

JulSenko commented Jun 13, 2023

k8s-ci-robot commented Jun 13, 2023

iwankgb Jun 18, 2023

JulSenko Jun 28, 2023

iwankgb Jun 29, 2023

iwankgb Jun 18, 2023

JulSenko Jun 28, 2023

iwankgb Jun 29, 2023

iwankgb Jun 18, 2023

JulSenko Jun 28, 2023

iwankgb Jun 29, 2023

iwankgb Jun 18, 2023

JulSenko Jun 28, 2023

iwankgb Jun 18, 2023

JulSenko Jun 28, 2023

iwankgb commented Jun 18, 2023

iwankgb Jun 29, 2023

iwankgb Jun 29, 2023

iwankgb Jun 29, 2023

iwankgb commented Jul 20, 2023

	return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %v", containerName, rmErr)
	return "", fmt.Errorf("couldn't clean up monitoring group matching %q container: %w", containerName, rmErr)

		pids, err := getContainerPids()
		pids, err := getPids(containerName)

Resctrl collector fixes #3326

Are you sure you want to change the base?

Resctrl collector fixes #3326

Conversation

JulSenko commented Jun 13, 2023

k8s-ci-robot commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iwankgb commented Jun 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iwankgb commented Jul 20, 2023