Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite interior and priorities #170

Open
nicolasaunai opened this issue Jun 22, 2021 · 5 comments
Open

Overwrite interior and priorities #170

nicolasaunai opened this issue Jun 22, 2021 · 5 comments

Comments

@nicolasaunai
Copy link

Hi,

We have a geometry that's alike the one of NodeData, meaning that we get some nodes that are shared by adjacent patches on borders/corners, and for which the value should be equal.

In the code, these border/corners nodes are assigned values from large summations over floating point numbers (particles data), and although their final value should be identical, it's not exactly because of the accumulation of truncation errors.

In serial executions, this is not a problem because overlaps are sequentially processed, and only one value prevails for all patches.

In parallel however, if we unconditionally overwrite interior nodes when exchanging data with schedules, the border nodes basically gets swapped between the two PatchDatas concerned with the processed overlap. So that if they have slightly different values as a result of truncations errors, they still do after. If we unconditionally set overwrite_interior to false, then border nodes are simply not assigned and keep their slightly different values.

Over time, this slight mismatch appears to grow until shared nodes have totally different values which crashes the model.

How to deal with this ?
We were hoping that setting overwrite_interior to true or false conditionally would help having one value only prevailing.

The documentation says :

The concept of ``overlap'' or data dependency is more complex for generic box geometry objects than for just cell-centered box indices in the abstract AMR index space. Problems arise in cases where data lies on the outside corners, faces, or edges of a box. For these data types, it is likely that there will exist duplicate data values on different patches.

The solution implemented here introduces the concept of ``priority'' between patches. Data of patches with higher priority can overwrite the interiors (face, node, or edge values associated with cells that constitute the interior of the patch) of patches with lower priorities, but lower priority patches can never overwrite the interiors of higher priority patches. This scheme introduces a total ordering of data and therefore eliminates the duplicate information problem.

In practice, this protocol means two things: (1) the communication routines must always process copies from low priority sources to high priority sources, and (2) patches must be given special permission to overwrite their interior values during a write. All destinations are therefore represented by three quantities: (1) the box geometry of the destination (which encodes the box, ghost cells, and geometry), (2) the box geometry of the source, and (3) a flag indicating whether the source has a higher priority than the destination (that is, whether the source can overwrite the interior of the destination). If the overwrite flag is set, then data will be copied over the specified box domain and may write into the interior of the destination. If the overwrite flag is not set, then data will be copied only into the ghost cell values and not the interior values of the patch.

however, in our override of boxgeometry, we don't really understand how this conditions should be set.
In 1D, where only 2 patches can share the same node, we could say that lower rank is always overwritten by largest rank.
But in 2D it seems such a condition would end-up being a race condition since a node could be shared by 3 or 4 patches, and the assignement would depend on the order in which overlaps are processed.

Is there some example or general advice as to how to set the "priority between patches" as the doc refers to?

@nselliott
Copy link
Collaborator

Your observations here make sense. My suggestion would be to use the method setDeterministicUnpackOrderingFlag() which exists in both RefineSchedule and CoarsenSchedule. When set to true this causes the processing of incoming data to happen in a deterministic sequence. Combined with a priority implementation in your BoxGeometry override class that gives priority to the data from the higher rank, this should cause a single value to end up on the shared nodes.

@nicolasaunai
Copy link
Author

Thanks for your quick response.
We have tried your suggestion. We may not have understood it fully because doing this does not solve the problem entirely.
See the following image :

image

This is a simple patch layout for a 1 level only simulation. Each rectangle represents a patch, it is ran on 10 MPI processes, the global ID (rank#localID) of the patch box is written in each patch.

We have 5 ghost nodes in each direction.

If you look more specifically at patches p0#1 and p1#4 : these patches share border nodes that are part of their respective interior. Doing 1 schedule that sets overwrite_interior to true only when source globalID > dest globalID makes this border strictly equal on both cores, as you suggested.

However, this border line extends on 5 ghost nodes on the p0#3 / p1#6 border. After the schedule is applied, these 5 ghosts of p0#1 and p1#4 still mismatch slightly. Our interpretation is that they do because they are processed by the schedule at a point where p0#3 / p1#6 have not yet exchanged/overwritten their own domain border nodes.

As a result, we do not see how simply following your suggestion can fix the mismatch for both shared border nodes and ghost border nodes overlapping other domain border nodes, probably because, in our opinion applying only 1 schedule cannot make it?

So what we did is :

  • apply a first schedule for which the overwrite_interior flag is set to true only when source globalID>dest global ID AND remove the interior of the source (border excluded) from the overlap.

  • apply a second schedule, where overwrite_interior is unconditionally set to false, so that only the 5 ghost nodes overlapping source interior (border excluded) of the source are set.

The first schedule makes sure that once done, there is no domain border mismatch. The second schedule updates ghost nodes which will now get the same value on borders.

We were wondering whether you would see a cleaner/simpler way to do this.

Also, we have done this by implementing a custom VariableFillPattern which is pretty much a copy paste of the BoxGeometryVariableFillPattern except the calculateOverlap method is our own. We assumed this function was not used to calculate overlaps needed for refining data between levels and that was exclusively done by the computeFillBoxesOverlap method which we have let untouched from the BoxGeometryVariableFillPattern. It seems ok, but would feel better with your confirmation.

@nselliott
Copy link
Collaborator

However, this border line extends on 5 ghost nodes on the p0#3 / p1#6 border. After the schedule is applied, these 5 ghosts of p0#1 and p1#4 still mismatch slightly. Our interpretation is that they do because they are processed by the schedule at a point where p0#3 / p1#6 have not yet exchanged/overwritten their own domain border nodes.

I can see how this is possible for the nodes specifically on p0#3 / p1#6 border.

As a result, we do not see how simply following your suggestion can fix the mismatch for both shared border nodes and ghost border nodes overlapping other domain border nodes, probably because, in our opinion applying only 1 schedule cannot make it?

So what we did is :

* apply a first schedule for which the overwrite_interior flag is set to true only when source globalID>dest global ID AND remove the interior of the source (border excluded) from the overlap.

* apply a second schedule, where overwrite_interior is unconditionally set to false, so that only the 5 ghost nodes overlapping source interior (border excluded) of the source are set.

The first schedule makes sure that once done, there is no domain border mismatch. The second schedule updates ghost nodes which will now get the same value on borders.

We were wondering whether you would see a cleaner/simpler way to do this.

I think you have a reasonable approach. Other applications I have worked with have done something like this to separate the operations on patch boundaries from the operations in the ghost regions. One thing I can suggest is that you could use PatchLevelInteriorFillPattern for your first schedule, so that it only exchanges data on the patch boundaries. Since your second schedule writes into all of the ghosts, you don't need the first schedule to duplicate that.

Also, we have done this by implementing a custom VariableFillPattern which is pretty much a copy paste of the BoxGeometryVariableFillPattern except the calculateOverlap method is our own. We assumed this function was not used to calculate overlaps needed for refining data between levels and that was exclusively done by the computeFillBoxesOverlap method which we have let untouched from the BoxGeometryVariableFillPattern. It seems ok, but would feel better with your confirmation.

This is correct, the calcluateOverlap methods is for overlaps within the same level of resolution.

@nicolasaunai
Copy link
Author

thanks

@PhilipDeegan
Copy link
Contributor

We have decided to no longer group multiple components with potentially disparate geometries for we have seen
some discrepancy with refinement schedules.

Notably this interface
https://github.com/LLNL/SAMRAI/blob/master/source/SAMRAI/xfer/RefineAlgorithm.C#L327

A bit of debugging of the "equivalence classes" during registration of the schedule shows some "true" comparisons,
which, I can't say does or does not have something to do with the discrepancy we see, but I find it a bit odd that
the fill pattern we are using is ignored and replaced with a new "BoxGeometryVariableFillPattern"
https://github.com/LLNL/SAMRAI/blob/master/source/SAMRAI/xfer/RefineSchedule.C#L995

The overlaps we receive from these operations, are correct for the first registered item,
but not for the latter as they have unique geometries potentially different from the first.

If you would like to see a reproduction of this issue just let me know and I'll put something together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants