[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

jposwiata · 2023-06-22T13:47:56Z

If no CPUs are enabled for particular package, the package is not being triaged (if the fault is isolated to this particular package). As a result, the test doesn't success (with "no test executed" skip reason) and doesn't introduce any confusion and unexpected reports suggesting regressions.

Proper number of test runs is logged (1/2/3/.. instead of 2/4/6).

Fixes #250

thiagomacieira

Please also write "Fixes #250" in the commit message.

framework/sandstone.cpp

thiagomacieira · 2023-06-22T16:32:49Z

framework/sandstone.cpp

-        for (; eit != topo.packages.end(); ++eit)
-            run_cpus.add_package(*eit);
+        // merge all CPUs from all packages in the list (w/ respect to the removed packages)
+        for (eit = it; eit != topo.packages.end(); ++eit) {


Hint:

for (auto eit = it; eit != topo.packages.end(); ++eit)

thiagomacieira · 2023-06-22T18:28:44Z

framework/sandstone.cpp

-        do {
-            ret = run_tests_on_cpu_set(triage_tests, run_cpus);
-        } while (!ret && ++k < sApp->retest_count);
+        if (!run_cpus.empty()) {


Would it make sense to make this:

if (run.cpus.empty()) continue;

?

That would mean not running the code at 'find first non-empty package to "remove"'.

'break' instead of 'continue' will do (as next packages will have empty lists as well). Decided to have more "linear", but I'm open for the change.

thiagomacieira · 2023-06-22T18:31:46Z

framework/sandstone.cpp

-            break;
+            if ((ret == EXIT_SUCCESS) && ever_failed) {
+                // the last socket removed is the main suspect
+                result.push_back((*removed_it).id);


Thinking... can we get removed_it == end here? By construction, we can't get here in the first loop iteration, because we can either set ever_failed or get here. So if this is the second loop, we must have run the code below.

No, "removed" entry is set properly (excluded conditions). At least second set must be handled to get here (first with FAIL, second with SUCCESS).
I considered to replace 'ever_failed' with 'removed != end()'.. but I wasn't sure of this change (and therefore left it). Will replace.

thiagomacieira · 2023-06-22T18:34:03Z

framework/sandstone.cpp

+        while (it != topo.packages.end()) {
+            if (!(*it).cores.empty()) {
+                removed_it = it;
+                it++;


Please write ++it for iterators. The two functions are slightly different:

_GLIBCXX20_CONSTEXPR __normal_iterator& operator++() _GLIBCXX_NOEXCEPT { ++_M_current; return *this; } _GLIBCXX20_CONSTEXPR __normal_iterator operator++(int) _GLIBCXX_NOEXCEPT { return __normal_iterator(_M_current++); }

Though the compiler should optimise the difference out of existence.

Also, both sides of the if advance the iterator. Might it not be simpler to write a for loop?

for ( ; it != topo.packages.end(); ++it) { if (!it->cores.empty()) { removed_it = it; break; }

or if you like functional programming:

removed_it = std::find_if(it, topo.packages.end(), [](auto eit) { return !eit->cores.empty(); });

C++ construct is best here, thanks.

thiagomacieira · 2023-06-22T18:39:17Z

framework/sandstone.cpp


-        run_cpus.add_package(topo.packages.at(0));
-        sApp->enabled_cpus = run_cpus;
+    if (ret > EXIT_SUCCESS) {


Please move up the comments from below, to explain that you are indeed checking the result of the last socket's run. This had me thinking if you didn't mean ever_failed here (you don't). The old code had the comments:

if (ret) { // failed on the last socket as well, so it's the main suspect // re-run on the first to make sure the last one is faulty

framework/sandstone.cpp

thiagomacieira · 2023-06-22T18:44:12Z

framework/sandstone.cpp

+                result.push_back((*removed_it).id);
+            }
+        } else {
+            // only one package has been checked


Is there any point in running triage for single-socket systems?

The function returns immediately if there's just one socket.

"Initial checking" is back (with improvements on "empty" packages).

Not sure if I'm understanding the whole idea. What are we trying to find? My "best hit" is to report set of shorter groups of packages which fails together. I haven't tested how slicing works on more-than-2S, and therefore not sure what is the implementation on.
In general a list of couples describes the expected result best, but it could be quite time-expensive, as the complexity is 2^n-1, and might not suit our needs/constraints.

Currently the implementation verifies only a subset of possibilities. For 2S platform packages 1+2, 2, 1 are being tested (what makes even less sense, as 1+2 is always expected to fail).

Currently "most common" are failures on separate packages, it can be implemented with small effort and without any changes in the interface.
Second group of failures is for inter-package communication, where pairs of packages (which fails together) should be reported.
Don't see any rational for bigger groups.

If no CPUs are enabled for particular package, the package is not being triaged (if the fault is isolated to this particular package). As a result, the test doesn't success (with "no test executed" skip reason) and doesn't introduce any confusion and unexpected reports suggesting regressions. Current pattern of triaged packages is: 2+..+n, ..., n, 1. For 2S (or any setups with 2 packages) the pattern is reduced to 2, 1. Proper number of test runs is logged (1/2/3/.. instead of 2/4/6). Signed-off-by: Jarek Poswiata <jaroslaw.poswiata@intel.com>

thiagomacieira · 2023-08-11T21:18:25Z

Probably unnecessary after the socket-separation branch merges.

jposwiata requested review from thac0, busykai and thiagomacieira June 22, 2023 13:47

jposwiata linked an issue Jun 22, 2023 that may be closed by this pull request

Triage mode crashes when all CPUs in socket fail #250

Closed

thac0 removed their request for review June 22, 2023 15:11

thiagomacieira requested changes Jun 22, 2023

View reviewed changes

jposwiata changed the title ~~Triage function improved for handling multiple packages with slicing.~~ [DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. Jun 23, 2023

jposwiata force-pushed the jposwiata-triage_with_slicing branch from 20480db to 378e4b8 Compare June 23, 2023 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

jposwiata commented Jun 22, 2023 •

edited by thiagomacieira

thiagomacieira left a comment

thiagomacieira Jun 22, 2023

thiagomacieira Jun 22, 2023

jposwiata Jun 23, 2023

thiagomacieira Jun 22, 2023

jposwiata Jun 23, 2023

thiagomacieira Jun 22, 2023

jposwiata Jun 23, 2023

thiagomacieira Jun 22, 2023

thiagomacieira Jun 22, 2023

busykai Jun 22, 2023

jposwiata Jun 23, 2023 •

edited

thiagomacieira commented Aug 11, 2023

[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

Are you sure you want to change the base?

[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

Conversation

jposwiata commented Jun 22, 2023 • edited by thiagomacieira

thiagomacieira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jposwiata Jun 23, 2023 • edited

Choose a reason for hiding this comment

thiagomacieira commented Aug 11, 2023

jposwiata commented Jun 22, 2023 •

edited by thiagomacieira

jposwiata Jun 23, 2023 •

edited