Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue #1619

franzpoeschel · 2024-05-08T13:22:44Z

This somewhat fixes #1616 until we add a better solution. With this PR: seriesFlush() will always flush the containing Iteration if called from within an Iteration (and will ignore missing dirty annotations).

At the same time, I added a better detection for BP5-specific features. Since this means that adios2::Engine::PerformDataWrite() is used automatically more often, this uncovers further parallel flushing bugs. So, these two items are treated together in this PR.

In a follow-up PR later on, as a more breaking change, we would also flush all open iterations in MPI-parallel contexts on series.flush(), but for this we will first need functionality to reopen iterations after close #1592.

TODO:

documentation
testing

test/ParallelIOTest.cpp

Somehow PerformDataWrite() leads to trouble with this pattern.

This reverts commit 36597bd. No longer needed after rebasing on fix-iteration-flush

It used Series::flush non-collectively

franzpoeschel · 2024-06-06T13:18:06Z

test/ParallelIOTest.cpp

@@ -946,10 +946,16 @@ void hipace_like_write(std::string const &file_ending)
    int const last_step = 100;
    int const my_first_step = i_mpi_rank * int(local_Nz);
    int const all_last_step = last_step + (i_mpi_size - 1) * int(local_Nz);
+
+    bool participate_in_barrier = true;


@ax3l Can you please check if this bug also affects Hipace? Currently, the sequence of barriers and flushes dont match from rank to rank. This was uncovered only now, since flushing is effectively not collective in many situations, but this test now uses adios2::Engine::PerformDataWrite() of BP5 which is a bit stricter there.

test/ParallelIOTest.cpp

+        "adios2.engine.preferred_flush_target = \"buffer\"");
+    int size, rank;
+    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+    MPI_Comm_size(MPI_COMM_WORLD, &size);


franzpoeschel added bug internal workaround labels May 8, 2024

ax3l added the affects latest release label May 31, 2024

ax3l self-requested a review May 31, 2024 18:04

ax3l self-assigned this May 31, 2024

github-advanced-security bot found potential problems Jun 3, 2024

View reviewed changes

test/ParallelIOTest.cpp Fixed Show fixed Hide fixed

franzpoeschel force-pushed the fix-iteration-flush branch from ce5bfc3 to 50fc0d1 Compare June 4, 2024 10:48

franzpoeschel added 5 commits June 5, 2024 11:42

retrieveIteration: return both Series and Iteration

4962161

Optimize implementation

26d2da1

seriesFlush(): Mark containing Iteration as dirty

7eb7e63

Add failing test

63706fe

Backend implementation

17c42f3

franzpoeschel force-pushed the fix-iteration-flush branch from 50fc0d1 to 17c42f3 Compare June 5, 2024 09:42

Add documentation

8bbc170

franzpoeschel force-pushed the fix-iteration-flush branch from 237b394 to 8bbc170 Compare June 6, 2024 13:08

franzpoeschel added 6 commits June 6, 2024 15:08

Add ADIOS2 v2.10 define and use that for BP5 check

50e371c

Ask the engine if it is BP5 for BP5-specific features

ae8030c

write_test_zero_extent: require flush to buffer

5eb1e8f

Somehow PerformDataWrite() leads to trouble with this pattern.

Revert "write_test_zero_extent: require flush to buffer"

b010832

This reverts commit 36597bd. No longer needed after rebasing on fix-iteration-flush

Fix hipace_like_write test

6dad31e

It used Series::flush non-collectively

Also ensure all ranks flush in group/variable encoding

92c5de8

franzpoeschel changed the title ~~Workaround for independent writes to Iterations in parallel~~ Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue Jun 6, 2024

franzpoeschel commented Jun 6, 2024

View reviewed changes

github-advanced-security bot found potential problems Jun 6, 2024

View reviewed changes

test/ParallelIOTest.cpp

"adios2.engine.preferred_flush_target = \"buffer\"");

int size, rank;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.

franzpoeschel force-pushed the fix-iteration-flush branch from 7b9d624 to 9dd2a14 Compare June 6, 2024 15:05

Seems we need the MPI workaround for Conda now too.....

72a465c

franzpoeschel force-pushed the fix-iteration-flush branch from 9dd2a14 to 72a465c Compare June 6, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue #1619

Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue #1619

franzpoeschel commented May 8, 2024 •

edited

franzpoeschel Jun 6, 2024

Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue #1619

Are you sure you want to change the base?

Workaround for independent writes to Iterations in parallel, better detection of BP5 which in turn uncovers more instances of the first issue #1619

Conversation

franzpoeschel commented May 8, 2024 • edited

franzpoeschel Jun 6, 2024

Choose a reason for hiding this comment

franzpoeschel commented May 8, 2024 •

edited