Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use serial ZOLTAN load balancer by default. #5214

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

blattms
Copy link
Member

@blattms blattms commented Feb 19, 2024

We have experienced rather poor partitioning when using the parallel version of ZOLTAN. We are not sure what the cause for this is and wheher we can fix this by using different defaults or different underlying partioners.

For the time being we simply change the default. This will cause some time increase when loadbalancing the grid but the rest of the simulator might actually be faster and compensate for this.

We have experienced rather poor partitioning when using the parallel
version of ZOLTAN. We are not sure what the cause for this is and wheher we can
fix this by using different defaults or different underlying
partioners.

For the time being we simply change the default. This will cause some
time increase when loadbalancing the grid but the rest of the
simulator might actually be faster and compensate for this.
@blattms
Copy link
Member Author

blattms commented Feb 19, 2024

jenkins build this please

@alfbr
Copy link
Member

alfbr commented Feb 19, 2024

This is a welcome change until we figure out how to make good partitions in parallel.

@blattms
Copy link
Member Author

blattms commented Feb 19, 2024

the error for SPE1CASE1 looks a bit suspicious

=== Executing comparison for files if these exists in reference folder ===
	 - SPE1CASE1_It0_Pr0.h5
dataset: </GLOBAL_CELL_INDEX/P0> and </GLOBAL_CELL_INDEX/P0>
105 differences found
dataset: </PRESSURE/P0> and </PRESSURE/P0>
105 differences found
dataset: </topologies/topo/elements/connectivity/P0> and </topologies/topo/elements/connectivity/P0>
828 differences found

I need to find out what this means. Topology should definitely not change here.
This seems to be damaris output, though. Might actually point to an issue there.

@akva2 How can I see the full output of ACTIONX_M1, the interesting part is cut because of all the time steps.

@akva2
Copy link
Member

akva2 commented Feb 19, 2024

that's the damaris test so topology arrays changes since the partitioning changes. I'll provide an update for the files once this is good to go. Alternatively you can switch the test to use parallel partitioning.

i don't know where the test output threshold is configured so I've sent you the output on slack.

@blattms
Copy link
Member Author

blattms commented Feb 19, 2024

Thanks the change in ACTION_M1 is a bit concerning, PBUB for restart is different for one cell:

329: Keyword: PBUB, origin Restart, sequence 15
329: Global index (zero based)   = 519
329: Grid coordinate             = (4, 8, 5)
329: (first value, second value) = (190.788, 187.465)
329: 
329: Program threw an exception: [/var/lib/jenkins/workspace/opm-simulators-PR-builder/deps/opm-common/test_util/EclRegressionTest.cpp:215] Deviations exceed tolerances.
329: The absolute deviation is 3.32342529296875, and the tolerance limit is 0.02.
329: The relative deviation is 0.017419448128747354, and the tolerance limit is 0.01.
329: Comparing '/.../results/parallel/flow+actionx_m1/ACTIONX_M1' to '/.../results/parallel/flow+actionx_m1/mpi/ACTIONX_M1'.

I might need to run this on my system so, what exactly changes. Maybe the actions are triggered at different times now.

@blattms
Copy link
Member Author

blattms commented Feb 19, 2024

I have no real clue. The reported difference is for restart of report step 15 it seems.

That is a bit strange because:
When comparing parallel simulations of master and this branch first and last time step in during report step 3 are slightly different (40 seconds) for case ACTION_M1. Actions during that report step are triggered in the same time steps, just the steps are 40 seconds later.

@blattms
Copy link
Member Author

blattms commented Feb 21, 2024

@bska Do you have a hint where the difference for the restart output of the last report step might come from?

@bska
Copy link
Member

bska commented Feb 22, 2024

Do you have a hint where the difference for the restart output of the last report step might come from?

I'm afraid I don't. This test has been unstable/sensitive for a long time. I reduced its maximum timestep size in commit a2fa381 (PR #4749), but I guess that just hid the problem instead of actually solving it. Add to that that we way we compute PBUB is also somewhat unstable and we have a recipe for problems which are hard to track down. In this case I'd be tempted to just remove the PBPD property from RPTRST which will remove the PBUB and PDEW restart file arrays.

@blattms
Copy link
Member Author

blattms commented Feb 23, 2024

In this case I'd be tempted to just remove the PBPD property from RPTRST which will remove the PBUB and PDEW restart file arrays.

Shall we do that then? @tskille, as the original case is from you: Would removing be OK with you?

@bska
Copy link
Member

bska commented Mar 7, 2024

I'm rerunning the build check here, mostly to recreate the detailed failure descriptions on the CI system.

@bska
Copy link
Member

bska commented Mar 7, 2024

jenkins build this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants