Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destroy pool fails with seg fault for multi-block #788

Open
mark-petersen opened this issue Jan 13, 2021 · 5 comments
Open

destroy pool fails with seg fault for multi-block #788

mark-petersen opened this issue Jan 13, 2021 · 5 comments

Comments

@mark-petersen
Copy link
Contributor

mark-petersen commented Jan 13, 2021

If we run with two blocks per core in RK4, the destroy pool fails at the end of the first time step. Error is a seg fault, with traceback:

deallocate(f_cursor)

deallocate(dptr % r2 % array, stat=local_err)

call mpas_pool_destroy_pool(provisStatePool)

This was introduced in #578.

We can invoke this error in the shallow water core, for example, by setting config_number_of_blocks = 36 but running on 18 cores.

@mark-petersen
Copy link
Contributor Author

@amametjanov and @mgduda I think this problem was introduced inadvertently, as I looked through the comments in #578 and there was no discussion or testing of multiple blocks.

@mgduda if you have a little time to triage, it would help to know if this multi-block failure is fundamental to this design, or just an oversight.

@mark-petersen
Copy link
Contributor Author

@philipwjones, FYI, it turns out multiple blocks per core are probably important for local time stepping. @gcapodag and I have been experimenting with load balancing for local time-stepping, and giving each core one low-res. and one high-res. block is probably the best strategy. That is how we found this bug.

@mgduda
Copy link
Contributor

mgduda commented Jan 13, 2021

I've been able to reproduce this issue in the SW core. If I'm not mistaken, it looks like the problem is that when we destroy a pool in one block, all blocks for the fields in that pool are deallocated; then, when destroying the same pool in the next block, we attempt to redundantly deallocate blocks of the fields in that pool.

I'll give this some though to see if there's a clean solution. The core problem is that if we have two pointers, say, A and B, to the same memory and we deallocate that memory through one of the pointers (A) we have no way to know that B points to memory that is no longer allocated.

Here's a demonstration of the issue:

program ptrfoo

    real, pointer :: a, b

    allocate(a)
    b => a

    deallocate(a)
    deallocate(b)

    stop

end program ptrfoo

@mark-petersen
Copy link
Contributor Author

@mgduda thanks for your input on this. I had thought the same thing, but the allocation is within a block loop:

block => domain % blocklist
do while (associated(block))
call mpas_pool_get_subpool(block % structs, 'mesh', meshPool)
call mpas_pool_get_subpool(block % structs, 'state', statePool)
allocate(provisStatePool)

so then we allocate once per block. So wouldn't it make sense to also deallocate once per block? Or are you saying that the second allocate for a pointer behaves as b => a? I'm used to allocate statements for allocatable arrays, not pointers.

@philipwjones
Copy link
Contributor

@mark-petersen if we really do need this functionality, we're going to need to implement it differently than the current linked-list method. The current implementation is incompatible with what we are doing on the GPU. Even just exposing the arrays with the extra block index at the end would be preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants