Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap _local_size into atomic to avoid unwanted modifications while operator() is used with set() in petsc vector. #3809

Closed
wants to merge 2 commits into from

Conversation

grmnptr
Copy link
Contributor

@grmnptr grmnptr commented Mar 19, 2024

Closes #3808

@roystgnr
Copy link
Member

I'm scared of whether these are the only fixes we need, but if this fixes your MOOSE case then it's definitely a step in the right direction.

Could you add something like your MOOSE failure case to a new unit test in tests/numerics/numeric_vector_test.h? Threading bugs are annoyingly hard to reproduce and if there's a chance of regression then I'd hope to hit it as far upstream as possible. I don't see any good threaded unit tests to point you to for an example, but git grep Threads::parallel_for points to a dozen-odd use cases in the library itself.

@grmnptr
Copy link
Contributor Author

grmnptr commented Mar 19, 2024

I can confirm that this change fixes the issue, based on 100 runs with 10 threads. The issue usually pooped up in 5-10 runs with this many threads.

@jwpeterson
Copy link
Member

I'm curious how you figured out that this was the fix, for example did you use a debugging tool that suggested it? Also, you made _first and _last std::atomic but we don't .load() or store them anywhere?

Finally, general question: is this (declaring a variable atomic) something that needs to be done differently depending on if TBB is used vs. std::threads?

@grmnptr
Copy link
Contributor Author

grmnptr commented Mar 19, 2024

To be honest I am not a threading expert, but I'll try to answer your questions as accurately as I can.
I got the following error from libmesh in debug mode:

libMesh terminating:
Assertion `local_index < _local_size' failed.
local_index = 24
_local_size = 0

The debugger pointed to the operator() line discussed in #3808 .
In set() , _local_size is set to 0 (through restore_array) and that can be the source of the problem. To be honest, @roystgnr pointed these out and I just followed his suggestions.

The reason I used load() is because make_range() has a T range(const T & range) underneath and this constructor has been deleted for atomic-wrapped quantities.

I have no idea if this would change based on different threading libraries, I was wrapping the member variables in atomic based on the examples on the c++ standard website so I hope oneTBB respects that.

I am also open to other suggestions for assigning a value to an already existing petsc vector that doesn't rely on other operations which manipulate the size.

@grmnptr
Copy link
Contributor Author

grmnptr commented Mar 19, 2024

I created a test for this and it exhibits out of memory access in the petsc vector. It is not deterministic, so I think this change did not solve all the problems. The new error is coming from

template <typename T>
inline
T PetscVector<T>::operator() (const numeric_index_type i) const
{
  this->_get_array(true);

  const numeric_index_type local_index = this->map_global_to_local_index(i);

#ifndef NDEBUG
  if (this->type() == GHOSTED)
    {
      libmesh_assert_less (local_index, _local_size);
    }
#endif

  return static_cast<T>(_read_only_values[local_index]); <<<< This line
}

with

(gdb) p _read_only_values
$1 = (const PetscScalar *) 0x0

In restore array, this gets set to nullptr too. I am not sure I can pull an atomic here (I am fairly certain I cant considering that this member is used with petsc functions directly). I'll push the test just in case.

@roystgnr
Copy link
Member

a debugging tool that suggested it

I prefer the phrase "a debugging assistant", thank you very much!

But seriously, this whole section of code terrifies me; I don't understand the C++ memory-order standards well at all. I tried to figure out all the variables that might get caught out by a read-only get-array transitioning to a read-write one, but aside from the miss we've already caught, I'm not even 100% certain std::atomic is a guaranteed fix; one reason I was begging for a unit test was so we'd get a chance to see if OSX and Linux behave differently.

I'm not yet sure what to do about _read_only_values.

@lindsayad
Copy link
Member

lindsayad commented Mar 20, 2024

Sigh, we'll have to dig into the double checked locking pattern again. With this PR, I still get these errors out of thread sanitizer on the MOOSE tests (using grmnptr/moose@63f614deae)

  Read of size 8 at 0x7b6c00009a90 by thread T25 (mutexes: write M0):
    #0 PetscObjectStateGet /home/lindad/projects/thread-debugging/petsc/src/sys/objects/state.c:35 (libpetsc.so.3.20+0x2202aa) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                                   
    #1 VecGhostStateSync_Private /home/lindad/projects/thread-debugging/petsc/src/vec/vec/impls/mpi/commonmpvec.c:22 (libpetsc.so.3.20+0x47c0ae) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                 
    #2 VecGhostGetLocalForm /home/lindad/projects/thread-debugging/petsc/src/vec/vec/impls/mpi/commonmpvec.c:86 (libpetsc.so.3.20+0x47c32b) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                      
    #3 libMesh::PetscVector<double>::_get_array(bool) const ../src/numerics/petsc_vector.C:1310 (libmesh_devel.so.0+0x175c79b) (BuildId: cf69db45820d76391fe47cf14d2e8de17bd59c2e)                                                                   
    #4 libMesh::PetscVector<double>::operator()(unsigned long) const /home/lindad/projects/thread-debugging/scripts/../libmesh/installed/include/libmesh/petsc_vector.h:1117 (libmoose-devel.so.0+0x17bb825) (BuildId: 4340c78b15653aafb14ee0b9de7d8\
cf4dcc67937)                                                                                                                                                                                                                                         
    #5 ComputeLinearFVGreenGaussGradientVolumeThread::operator()(libMesh::StoredRange<MooseMesh::const_elem_info_iterator, ElemInfo const*> const&) /home/lindad/projects/thread-debugging/framework/src/loops/ComputeLinearFVGreenGaussGradientVolu\
meThread.C:68 (libmoose-devel.so.0+0x28599bc) (BuildId: 4340c78b15653aafb14ee0b9de7d8cf4dcc67937)                                                                                                                                                    
    #6 void* libMesh::Threads::run_body<libMesh::StoredRange<MooseMesh::const_elem_info_iterator, ElemInfo const*>, ComputeLinearFVGreenGaussGradientVolumeThread>(void*) /home/lindad/projects/thread-debugging/scripts/../libmesh/installed/includ\
e/libmesh/threads_pthread.h:235 (libmoose-devel.so.0+0x3833526) (BuildId: 4340c78b15653aafb14ee0b9de7d8cf4dcc67937)                                                                                                                                  

  Previous write of size 8 at 0x7b6c00009a90 by thread T24 (mutexes: write M1):
    #0 VecRestoreArray /home/lindad/projects/thread-debugging/petsc/src/vec/vec/interface/rvector.c:2015 (libpetsc.so.3.20+0x4a2f49) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                             
    #1 VecSetValues_MPI /home/lindad/projects/thread-debugging/petsc/src/vec/vec/impls/mpi/pdvec.c:862 (libpetsc.so.3.20+0x48f1c5) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                               
    #2 VecSetValues /home/lindad/projects/thread-debugging/petsc/src/vec/vec/interface/rvector.c:911 (libpetsc.so.3.20+0x49eca2) (BuildId: d25bcc0da88dba4779b6e3da251e6ef6ff8744b7)                                                                 
    #3 libMesh::PetscVector<double>::set(unsigned long, double) ../src/numerics/petsc_vector.C:159 (libmesh_devel.so.0+0x175b612) (BuildId: cf69db45820d76391fe47cf14d2e8de17bd59c2e)                                                                
    #4 ComputeLinearFVGreenGaussGradientVolumeThread::operator()(libMesh::StoredRange<MooseMesh::const_elem_info_iterator, ElemInfo const*> const&) /home/lindad/projects/thread-debugging/framework/src/loops/ComputeLinearFVGreenGaussGradientVolu\
meThread.C:69 (libmoose-devel.so.0+0x2859a0f) (BuildId: 4340c78b15653aafb14ee0b9de7d8cf4dcc67937)                                                                                                                                                    
    #5 void* libMesh::Threads::run_body<libMesh::StoredRange<MooseMesh::const_elem_info_iterator, ElemInfo const*>, ComputeLinearFVGreenGaussGradientVolumeThread>(void*) /home/lindad/projects/thread-debugging/scripts/../libmesh/installed/includ\
e/libmesh/threads_pthread.h:235 (libmoose-devel.so.0+0x3833526) (BuildId: 4340c78b15653aafb14ee0b9de7d8cf4dcc67937)

@grmnptr
Copy link
Contributor Author

grmnptr commented Mar 20, 2024

Thanks for running the sanitizer on the PR. I disabled threading on that new capability for the time being.

@grmnptr
Copy link
Contributor Author

grmnptr commented Mar 21, 2024

I'll close this PR for now. I don't think I'll dive deeper, threading is not yet a requirement for the capability I am working on.

@grmnptr grmnptr closed this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator () of a petsc vector is not safe if used together with set() in a threaded loop.
4 participants