Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. #254

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jposwiata
Copy link
Contributor

@jposwiata jposwiata commented Jun 22, 2023

If no CPUs are enabled for particular package, the package is not being triaged (if the fault is isolated to this particular package). As a result, the test doesn't success (with "no test executed" skip reason) and doesn't introduce any confusion and unexpected reports suggesting regressions.

Proper number of test runs is logged (1/2/3/.. instead of 2/4/6).

Fixes #250

@jposwiata jposwiata linked an issue Jun 22, 2023 that may be closed by this pull request
@thac0 thac0 removed their request for review June 22, 2023 15:11
Copy link
Contributor

@thiagomacieira thiagomacieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also write "Fixes #250" in the commit message.

framework/sandstone.cpp Outdated Show resolved Hide resolved
for (; eit != topo.packages.end(); ++eit)
run_cpus.add_package(*eit);
// merge all CPUs from all packages in the list (w/ respect to the removed packages)
for (eit = it; eit != topo.packages.end(); ++eit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hint:

    for (auto eit = it; eit != topo.packages.end(); ++eit)

do {
ret = run_tests_on_cpu_set(triage_tests, run_cpus);
} while (!ret && ++k < sApp->retest_count);
if (!run_cpus.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this:

    if (run.cpus.empty())
        continue;

?

That would mean not running the code at 'find first non-empty package to "remove"'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'break' instead of 'continue' will do (as next packages will have empty lists as well). Decided to have more "linear", but I'm open for the change.

break;
if ((ret == EXIT_SUCCESS) && ever_failed) {
// the last socket removed is the main suspect
result.push_back((*removed_it).id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking... can we get removed_it == end here? By construction, we can't get here in the first loop iteration, because we can either set ever_failed or get here. So if this is the second loop, we must have run the code below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, "removed" entry is set properly (excluded conditions). At least second set must be handled to get here (first with FAIL, second with SUCCESS).
I considered to replace 'ever_failed' with 'removed != end()'.. but I wasn't sure of this change (and therefore left it). Will replace.

while (it != topo.packages.end()) {
if (!(*it).cores.empty()) {
removed_it = it;
it++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write ++it for iterators. The two functions are slightly different:

      _GLIBCXX20_CONSTEXPR
      __normal_iterator&
      operator++() _GLIBCXX_NOEXCEPT
      {
	++_M_current;
	return *this;
      }

      _GLIBCXX20_CONSTEXPR
      __normal_iterator
      operator++(int) _GLIBCXX_NOEXCEPT
      { return __normal_iterator(_M_current++); }

Though the compiler should optimise the difference out of existence.

Also, both sides of the if advance the iterator. Might it not be simpler to write a for loop?

for ( ; it != topo.packages.end(); ++it) {
    if (!it->cores.empty()) {
        removed_it = it;
        break;
}

or if you like functional programming:

removed_it = std::find_if(it, topo.packages.end(), [](auto eit) { return !eit->cores.empty(); });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ construct is best here, thanks.


run_cpus.add_package(topo.packages.at(0));
sApp->enabled_cpus = run_cpus;
if (ret > EXIT_SUCCESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move up the comments from below, to explain that you are indeed checking the result of the last socket's run. This had me thinking if you didn't mean ever_failed here (you don't). The old code had the comments:

    if (ret) { // failed on the last socket as well, so it's the main suspect
        // re-run on the first to make sure the last one is faulty

framework/sandstone.cpp Show resolved Hide resolved
result.push_back((*removed_it).id);
}
} else {
// only one package has been checked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any point in running triage for single-socket systems?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns immediately if there's just one socket.

Copy link
Contributor Author

@jposwiata jposwiata Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Initial checking" is back (with improvements on "empty" packages).

Not sure if I'm understanding the whole idea. What are we trying to find? My "best hit" is to report set of shorter groups of packages which fails together. I haven't tested how slicing works on more-than-2S, and therefore not sure what is the implementation on.
In general a list of couples describes the expected result best, but it could be quite time-expensive, as the complexity is 2^n-1, and might not suit our needs/constraints.

Currently the implementation verifies only a subset of possibilities. For 2S platform packages 1+2, 2, 1 are being tested (what makes even less sense, as 1+2 is always expected to fail).

Currently "most common" are failures on separate packages, it can be implemented with small effort and without any changes in the interface.
Second group of failures is for inter-package communication, where pairs of packages (which fails together) should be reported.
Don't see any rational for bigger groups.

@jposwiata jposwiata changed the title Triage function improved for handling multiple packages with slicing. [DO NOT MERGE YET] Triage function improved for handling multiple packages with slicing. Jun 23, 2023
@jposwiata jposwiata force-pushed the jposwiata-triage_with_slicing branch from 20480db to 378e4b8 Compare June 23, 2023 17:37
If no CPUs are enabled for particular package, the package is not being
triaged (if the fault is isolated to this particular package).
As a result, the test doesn't success (with "no test executed" skip reason)
and doesn't introduce any confusion and unexpected reports suggesting
regressions.
Current pattern of triaged packages is: 2+..+n, ..., n, 1. For 2S (or
any setups with 2 packages) the pattern is reduced to 2, 1.
Proper number of test runs is logged (1/2/3/.. instead of 2/4/6).

Signed-off-by: Jarek Poswiata <jaroslaw.poswiata@intel.com>
@thiagomacieira
Copy link
Contributor

Probably unnecessary after the socket-separation branch merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Triage mode crashes when all CPUs in socket fail
3 participants