New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also write "Fixes #250" in the commit message.
framework/sandstone.cpp
Outdated
for (; eit != topo.packages.end(); ++eit) | ||
run_cpus.add_package(*eit); | ||
// merge all CPUs from all packages in the list (w/ respect to the removed packages) | ||
for (eit = it; eit != topo.packages.end(); ++eit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hint:
for (auto eit = it; eit != topo.packages.end(); ++eit)
framework/sandstone.cpp
Outdated
do { | ||
ret = run_tests_on_cpu_set(triage_tests, run_cpus); | ||
} while (!ret && ++k < sApp->retest_count); | ||
if (!run_cpus.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make this:
if (run.cpus.empty())
continue;
?
That would mean not running the code at 'find first non-empty package to "remove"'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'break' instead of 'continue' will do (as next packages will have empty lists as well). Decided to have more "linear", but I'm open for the change.
framework/sandstone.cpp
Outdated
break; | ||
if ((ret == EXIT_SUCCESS) && ever_failed) { | ||
// the last socket removed is the main suspect | ||
result.push_back((*removed_it).id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking... can we get removed_it == end
here? By construction, we can't get here in the first loop iteration, because we can either set ever_failed
or get here. So if this is the second loop, we must have run the code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, "removed" entry is set properly (excluded conditions). At least second set must be handled to get here (first with FAIL, second with SUCCESS).
I considered to replace 'ever_failed' with 'removed != end()'.. but I wasn't sure of this change (and therefore left it). Will replace.
framework/sandstone.cpp
Outdated
while (it != topo.packages.end()) { | ||
if (!(*it).cores.empty()) { | ||
removed_it = it; | ||
it++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write ++it
for iterators. The two functions are slightly different:
_GLIBCXX20_CONSTEXPR
__normal_iterator&
operator++() _GLIBCXX_NOEXCEPT
{
++_M_current;
return *this;
}
_GLIBCXX20_CONSTEXPR
__normal_iterator
operator++(int) _GLIBCXX_NOEXCEPT
{ return __normal_iterator(_M_current++); }
Though the compiler should optimise the difference out of existence.
Also, both sides of the if
advance the iterator. Might it not be simpler to write a for loop?
for ( ; it != topo.packages.end(); ++it) {
if (!it->cores.empty()) {
removed_it = it;
break;
}
or if you like functional programming:
removed_it = std::find_if(it, topo.packages.end(), [](auto eit) { return !eit->cores.empty(); });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ construct is best here, thanks.
framework/sandstone.cpp
Outdated
|
||
run_cpus.add_package(topo.packages.at(0)); | ||
sApp->enabled_cpus = run_cpus; | ||
if (ret > EXIT_SUCCESS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move up the comments from below, to explain that you are indeed checking the result of the last socket's run. This had me thinking if you didn't mean ever_failed
here (you don't). The old code had the comments:
if (ret) { // failed on the last socket as well, so it's the main suspect
// re-run on the first to make sure the last one is faulty
framework/sandstone.cpp
Outdated
result.push_back((*removed_it).id); | ||
} | ||
} else { | ||
// only one package has been checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any point in running triage for single-socket systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns immediately if there's just one socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Initial checking" is back (with improvements on "empty" packages).
Not sure if I'm understanding the whole idea. What are we trying to find? My "best hit" is to report set of shorter groups of packages which fails together. I haven't tested how slicing works on more-than-2S, and therefore not sure what is the implementation on.
In general a list of couples describes the expected result best, but it could be quite time-expensive, as the complexity is 2^n-1, and might not suit our needs/constraints.
Currently the implementation verifies only a subset of possibilities. For 2S platform packages 1+2, 2, 1 are being tested (what makes even less sense, as 1+2 is always expected to fail).
Currently "most common" are failures on separate packages, it can be implemented with small effort and without any changes in the interface.
Second group of failures is for inter-package communication, where pairs of packages (which fails together) should be reported.
Don't see any rational for bigger groups.
20480db
to
378e4b8
Compare
If no CPUs are enabled for particular package, the package is not being triaged (if the fault is isolated to this particular package). As a result, the test doesn't success (with "no test executed" skip reason) and doesn't introduce any confusion and unexpected reports suggesting regressions. Current pattern of triaged packages is: 2+..+n, ..., n, 1. For 2S (or any setups with 2 packages) the pattern is reduced to 2, 1. Proper number of test runs is logged (1/2/3/.. instead of 2/4/6). Signed-off-by: Jarek Poswiata <jaroslaw.poswiata@intel.com>
Probably unnecessary after the socket-separation branch merges. |
If no CPUs are enabled for particular package, the package is not being triaged (if the fault is isolated to this particular package). As a result, the test doesn't success (with "no test executed" skip reason) and doesn't introduce any confusion and unexpected reports suggesting regressions.
Proper number of test runs is logged (1/2/3/.. instead of 2/4/6).
Fixes #250