Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Td mpi #1274

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Td mpi #1274

wants to merge 24 commits into from

Conversation

thomas-niehaus
Copy link
Contributor

@thomas-niehaus thomas-niehaus commented Jul 6, 2023

First attempt of an MPI version of Casida. Currently supports Arpack solver and can do spectra for singlets/triplets/temperature/windows/spin polarized/onsite corrected. Based on the MPI version (not BLACS) of arpack-ng.
To compile, compile first arpack-ng (works out of the box) and then include the path to the lib via CMAKE_PREFIX_PATH.

  • The Arpack section of the code is fully parallel, not only the matrix-vector product.
  • The code seems to scale for a couple of processors. Could not test it nProcs > 6, because our HPC center doesn't have the compilers dftb+ requires ;-) UPDATE: Code now compiles on the HPC with ifort 2021, but spin-polarized calculations crash. Issue of MPI communicators?
  • Since I use the arpack MPI library, I make no use of the new BLACS excVectorGrid feature. Distribution is done in the routine localSizeCasidaVectors.
  • The bare routine transq can no longer be called under MPI (it is simply too costly to evaluate it with global indices). Consequence: CachedCharges is mandatory for MPI. The routine calcTransitionDipoles was using transq, therefore had to call TTransCharges_init earlier, which makes a second call to TTransCharges_init necessary, in case windowing is used.
  • Please have a look how code can be made more elegant, there are to my taste too many preprocessor directives. Example in linrespgrad.F90 line 639. This is the first time I write MPI code so my approach might be cumbersome/laughable.
  • I have started to parallelize rangeSep, which should pose no problems. It needs more memory though, need nAtoms x nOcc x nVir global array. Currently the ground state is not MPI parallel, so I will not continue right now.
  • Can you have a look at transcharges.F90 line 182?
  • getAtomicGammaMatrixBlacs is maybe not what you want.
  • local2GlobalBlacsArray should maybe go to a different module.
  • The tests contain some MPI cases

Let me know what you think.....

@bhourahine
Copy link
Member

Looks like the arpack-ng finder isn't working with a CMAKE_MODULE_PATH pointing to the location of arpackng-config.cmake. Probably should also cope with old style parpack install, as that's more likely to be in the usual deb/rpm installs.

@thomas-niehaus
Copy link
Contributor Author

Looks like the arpack-ng finder isn't working with a CMAKE_MODULE_PATH pointing to the location of arpackng-config.cmake. Probably should also cope with old style parpack install, as that's more likely to be in the usual deb/rpm installs.

But you can compile on dpdevel? Works for me using env_dftb+/gcc12-openmpi. I get yet another error (compared to my HPC center) with the NO example. Will look into this.

@bhourahine
Copy link
Member

But you can compile on dpdevel? Works for me using env_dftb+/gcc12-openmpi. I get yet another error (compared to my HPC center) with the NO example. Will look into this.

Exactly which cmake options are you using? As with that environment on buildbot, I get undefined references to pdsaupd_

@thomas-niehaus
Copy link
Contributor Author

Hi Ben,
I compiled ARPACK on dpdevel. You need to issue
CMAKE_PREFIX_PATH="/home/niehaus/lib:${CMAKE_PREFIX_PATH}"
Then compile as usual.

@thomas-niehaus
Copy link
Contributor Author

Now I also have a test with more cpu. Looks reasonable, I think.
plot.pdf

@bhourahine
Copy link
Member

bhourahine commented Feb 5, 2024 via email

@thomas-niehaus
Copy link
Contributor Author

But you had some problems with missing symbols, didn't you? I don't see arpack in the module list, but remember vaguely that Balint had arpack installed.

@bhourahine
Copy link
Member

bhourahine commented Feb 5, 2024 via email

Copy link
Member

@bhourahine bhourahine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue I'm noticing is that the cmake export for the arpack/parpack is now needed, so probably requires them to be built to get this on systems using packaging.

@:ASSERT(this%elstatType == elstatTypes%gammaFunc)

gammamat(:,:) = 0.0_dp
do ii = 1, this%coulomb%descInvRMat_(M_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copying the distributed matrix to a local matrix? The current structure behaves like a serial part of the code, as every process loops over the global matrix sizing.The gemr2d routine might be faster, or alternatively, loop over the local size of the invRMat and use scalafx_indxl2g to find the global location (see calls in coulomb.F90).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to get a global copy of Gamma. The partitioning of RPA vectors is not done with PBLAS, hence there is no point in distributing gamma. Can you come up with a nicer way to build it? I tried a lot of different things. This solution worked at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to drop the finder? This could cope with both arpack and arpackng.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Balint can comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aradi any views?

src/dftbp/timedep/linrespcommon.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespcommon.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespcommon.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/transcharges.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespgrad.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespgrad.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespgrad.F90 Outdated Show resolved Hide resolved
src/dftbp/timedep/linrespgrad.F90 Outdated Show resolved Hide resolved
@thomas-niehaus thomas-niehaus self-assigned this Feb 8, 2024
@bhourahine
Copy link
Member

@thomas-niehaus Running on buildbot with 2 MPI procs fails for several tests (block sizes are too small), but also timedep/C66O10N4H44_Ewindow fails for exc_oscillator strength. I used the intel22 MPI with your arpack-ng libraries.

@bhourahine
Copy link
Member

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors:
cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv && ctest -R xlbomd

@bhourahine
Copy link
Member

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors: cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv && ctest -R xlbomd

Might be specific to one test machine, I'll get back to you about this.

Parallelism for the short gamma contribution

Add timers for some of linear response process

Move BLACS matrix reorder routine and generalize

Enable MPI Stratmann build without ARPACK solver
@bhourahine
Copy link
Member

@thomas-niehaus Branch changes now merged in. I noticed that dipole intensities for propadiene_OscWindow-Stratmann are OK for 1 processor, but differ for 2.

@thomas-niehaus
Copy link
Contributor Author

thomas-niehaus commented Feb 16, 2024

  • The bug for propadiene_OscWindow-Stratmann is related to the Blacs = BlockSize. The result depends on the blocksize that I choose, which should not be the case. This example has few atoms and the default blocksize returns a parser error. So I chose Blocksize = 1 such that the code runs. Any ideas why this could lead to errors?

  • Another question: None of the tests has the condition MPI_PROCS > 1. Is there a reason why? Seems this will miss a lot of bugs, since most people (including me) keep the default in config.cmake.

  • I can not reproduce the xlbomd bug on my machine.

@bhourahine
Copy link
Member

@thomas-niehaus the N2_onsite test (and a few surrounding ones) doesn't have a trap for the number of processors. In that specific one, 16 processors fails (probably by that time some matrix isn't distributed properly). Can you add a MPI_PROCS <= 16 or similar to all of timedep/2CH3-Temp, timedep/2CH3-Triplet-Temp, timedep/NO, timedep/NO_onsite, timedep/N2_onsite, timedep/propadiene_OscWindow and have a look why the N2_onsite the the only one failing on 16 procs ?

Note, N2_onsite (only) from these fails on 16 procs, the others still
run.
@bhourahine
Copy link
Member

@thomas-niehaus I've just checked in a limit to the tests to <=8 procs and will see if I can find out why the N2_onsite was different.

@bhourahine
Copy link
Member

@thomas-niehaus I've just checked in a limit to the tests to <=8 procs and will see if I can find out why the N2_onsite was different.

The error is inside the call to pdsaupd, so probably not much we can do.

@thomas-niehaus thomas-niehaus marked this pull request as ready for review February 23, 2024 09:35
@thomas-niehaus
Copy link
Contributor Author

All tests (serial/parallel) pass locally, the buildbot errors seem to be due to the CMAKE arpack-ng issue. @balint: could you have a look at it? Otherwise the PR seems to be ready.

@thomas-niehaus
Copy link
Contributor Author

One final thing. If i compile the serial version with ifort and all checks on, I get a runtime error "Attempt to fetch from allocatable variable LOCSIZE when it is not allocated". In the serial version the locSize array is never allocated, used or touched. It is just passed (unallocated) to buildAndDiagExcMatrixArpack as integer, intent(in) :: locSize(:). This is sufficient to throw the error. Any ideas why this happens?

@bhourahine
Copy link
Member

@thomas-niehaus That's a missing allocatable from the variable definition in the affected subroutine, as intel tests this in more recent versions. I'll push changes to affected variables in a couple of minutes.

@bhourahine
Copy link
Member

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors: cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv &.& ctest -R xlbomd

Might be specific to one test machine, I'll get back to you about this.

Yes, machine specific, so not a problem for your changes (will need to investigate why the main branch was failing for these tests)

@bhourahine bhourahine added enhancement MPI Relates to distributed parallel labels Mar 12, 2024
@bhourahine bhourahine added this to the 24.2 milestone Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement MPI Relates to distributed parallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants