Td mpi #1274

thomas-niehaus · 2023-07-06T15:15:28Z

First attempt of an MPI version of Casida. Currently supports Arpack solver and can do spectra for singlets/triplets/temperature/windows/spin polarized/onsite corrected. Based on the MPI version (not BLACS) of arpack-ng.
To compile, compile first arpack-ng (works out of the box) and then include the path to the lib via CMAKE_PREFIX_PATH.

The Arpack section of the code is fully parallel, not only the matrix-vector product.
The code seems to scale for a couple of processors. Could not test it nProcs > 6, because our HPC center doesn't have the compilers dftb+ requires ;-) UPDATE: Code now compiles on the HPC with ifort 2021, but spin-polarized calculations crash. Issue of MPI communicators?
Since I use the arpack MPI library, I make no use of the new BLACS excVectorGrid feature. Distribution is done in the routine localSizeCasidaVectors.
The bare routine transq can no longer be called under MPI (it is simply too costly to evaluate it with global indices). Consequence: CachedCharges is mandatory for MPI. The routine calcTransitionDipoles was using transq, therefore had to call TTransCharges_init earlier, which makes a second call to TTransCharges_init necessary, in case windowing is used.
Please have a look how code can be made more elegant, there are to my taste too many preprocessor directives. Example in linrespgrad.F90 line 639. This is the first time I write MPI code so my approach might be cumbersome/laughable.
I have started to parallelize rangeSep, which should pose no problems. It needs more memory though, need nAtoms x nOcc x nVir global array. Currently the ground state is not MPI parallel, so I will not continue right now.
Can you have a look at transcharges.F90 line 182?
getAtomicGammaMatrixBlacs is maybe not what you want.
local2GlobalBlacsArray should maybe go to a different module.
The tests contain some MPI cases

Let me know what you think.....

bhourahine · 2023-08-31T20:08:36Z

Looks like the arpack-ng finder isn't working with a CMAKE_MODULE_PATH pointing to the location of arpackng-config.cmake. Probably should also cope with old style parpack install, as that's more likely to be in the usual deb/rpm installs.

thomas-niehaus · 2023-09-05T12:00:41Z

Looks like the arpack-ng finder isn't working with a CMAKE_MODULE_PATH pointing to the location of arpackng-config.cmake. Probably should also cope with old style parpack install, as that's more likely to be in the usual deb/rpm installs.

But you can compile on dpdevel? Works for me using env_dftb+/gcc12-openmpi. I get yet another error (compared to my HPC center) with the NO example. Will look into this.

bhourahine · 2023-09-05T19:23:39Z

But you can compile on dpdevel? Works for me using env_dftb+/gcc12-openmpi. I get yet another error (compared to my HPC center) with the NO example. Will look into this.

Exactly which cmake options are you using? As with that environment on buildbot, I get undefined references to pdsaupd_

thomas-niehaus · 2024-02-05T08:53:20Z

Hi Ben,
I compiled ARPACK on dpdevel. You need to issue
CMAKE_PREFIX_PATH="/home/niehaus/lib:${CMAKE_PREFIX_PATH}"
Then compile as usual.

thomas-niehaus · 2024-02-05T08:54:42Z

Now I also have a test with more cpu. Looks reasonable, I think.
plot.pdf

bhourahine · 2024-02-05T12:31:42Z

Hi Thomas, Does module add env/_whatever_compiler_ (i.e. the gnu/intel/whatever build environment) not put this into the path? Regards, Ben

thomas-niehaus · 2024-02-05T12:34:29Z

But you had some problems with missing symbols, didn't you? I don't see arpack in the module list, but remember vaguely that Balint had arpack installed.

bhourahine · 2024-02-05T12:36:04Z

I make that ~97% parallel, pretty good!

bhourahine

The biggest issue I'm noticing is that the cmake export for the arpack/parpack is now needed, so probably requires them to be built to get this on systems using packaging.

bhourahine · 2024-02-05T22:28:50Z

src/dftbp/dftb/scc.F90

+    @:ASSERT(this%elstatType == elstatTypes%gammaFunc)
+
+    gammamat(:,:) = 0.0_dp
+    do ii = 1, this%coulomb%descInvRMat_(M_)


Is this copying the distributed matrix to a local matrix? The current structure behaves like a serial part of the code, as every process loops over the global matrix sizing.The gemr2d routine might be faster, or alternatively, loop over the local size of the invRMat and use scalafx_indxl2g to find the global location (see calls in coulomb.F90).

Yes, this is to get a global copy of Gamma. The partitioning of RPA vectors is not done with PBLAS, hence there is no point in distributing gamma. Can you come up with a nicer way to build it? I tried a lot of different things. This solution worked at least.

bhourahine · 2024-02-05T22:29:16Z

cmake/Modules/FindCustomArpack.cmake

Any reason to drop the finder? This could cope with both arpack and arpackng.

Maybe Balint can comment?

@aradi any views?

src/dftbp/timedep/linrespcommon.F90

src/dftbp/timedep/transcharges.F90

src/dftbp/timedep/linrespgrad.F90

bhourahine · 2024-02-10T12:01:37Z

@thomas-niehaus Running on buildbot with 2 MPI procs fails for several tests (block sizes are too small), but also timedep/C66O10N4H44_Ewindow fails for exc_oscillator strength. I used the intel22 MPI with your arpack-ng libraries.

src/dftbp/timedep/transcharges.F90

src/dftbp/timedep/linrespcommon.F90

bhourahine · 2024-02-15T16:42:45Z

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors:
cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv && ctest -R xlbomd

bhourahine · 2024-02-15T16:57:29Z

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors: cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv && ctest -R xlbomd

Might be specific to one test machine, I'll get back to you about this.

Parallelism for the short gamma contribution Add timers for some of linear response process Move BLACS matrix reorder routine and generalize Enable MPI Stratmann build without ARPACK solver

bhourahine · 2024-02-15T20:34:36Z

@thomas-niehaus Branch changes now merged in. I noticed that dipole intensities for propadiene_OscWindow-Stratmann are OK for 1 processor, but differ for 2.

thomas-niehaus · 2024-02-16T18:22:55Z

The bug for propadiene_OscWindow-Stratmann is related to the Blacs = BlockSize. The result depends on the blocksize that I choose, which should not be the case. This example has few atoms and the default blocksize returns a parser error. So I chose Blocksize = 1 such that the code runs. Any ideas why this could lead to errors?
Another question: None of the tests has the condition MPI_PROCS > 1. Is there a reason why? Seems this will miss a lot of bugs, since most people (including me) keep the default in config.cmake.
I can not reproduce the xlbomd bug on my machine.

bhourahine · 2024-02-20T06:01:54Z

@thomas-niehaus the N2_onsite test (and a few surrounding ones) doesn't have a trap for the number of processors. In that specific one, 16 processors fails (probably by that time some matrix isn't distributed properly). Can you add a MPI_PROCS <= 16 or similar to all of timedep/2CH3-Temp, timedep/2CH3-Triplet-Temp, timedep/NO, timedep/NO_onsite, timedep/N2_onsite, timedep/propadiene_OscWindow and have a look why the N2_onsite the the only one failing on 16 procs ?

Note, N2_onsite (only) from these fails on 16 procs, the others still run.

bhourahine · 2024-02-20T09:00:59Z

@thomas-niehaus I've just checked in a limit to the tests to <=8 procs and will see if I can find out why the N2_onsite was different.

bhourahine · 2024-02-20T10:17:37Z

@thomas-niehaus I've just checked in a limit to the tests to <=8 procs and will see if I can find out why the N2_onsite was different.

The error is inside the call to pdsaupd, so probably not much we can do.

thomas-niehaus · 2024-02-23T09:38:55Z

All tests (serial/parallel) pass locally, the buildbot errors seem to be due to the CMAKE arpack-ng issue. @balint: could you have a look at it? Otherwise the PR seems to be ready.

thomas-niehaus · 2024-02-29T14:28:01Z

One final thing. If i compile the serial version with ifort and all checks on, I get a runtime error "Attempt to fetch from allocatable variable LOCSIZE when it is not allocated". In the serial version the locSize array is never allocated, used or touched. It is just passed (unallocated) to buildAndDiagExcMatrixArpack as integer, intent(in) :: locSize(:). This is sufficient to throw the error. Any ideas why this happens?

bhourahine · 2024-02-29T15:35:25Z

@thomas-niehaus That's a missing allocatable from the variable definition in the affected subroutine, as intel tests this in more recent versions. I'll push changes to affected variables in a couple of minutes.

bhourahine · 2024-03-05T18:14:06Z

@thomas-niehaus Something got broken for both the xlbomd and derivatives tests, which appears for 2 processors: cmake -DTEST_MPI_PROCS=2 ..; ctest -R deriv &.& ctest -R xlbomd

Might be specific to one test machine, I'll get back to you about this.

Yes, machine specific, so not a problem for your changes (will need to investigate why the main branch was failing for these tests)

aradi and others added 10 commits May 26, 2023 10:03

Link PARPACK when MPI and ARPACK are both enabled

2c6cbc1

Transition charges MPI work

ad1f52f

First working TD MPI

97ac2ce

Qij evaluation really bad. code works but slow. no scaling

1dd106d

Before restructuring qtrans

d9a8889

Version with safe qij in MPI and global v that works

9b45919

First working version, osc str. wrong

deb390e

Clean up. env still in many calls but not yet used

dd5d9b7

Onsite and spin work

58a0072

Clean version

8433d99

thomas-niehaus requested review from bhourahine and aradi July 6, 2023 15:15

vanderhe mentioned this pull request Sep 9, 2023

No ARPACK/ngARPACK library in conda-forge installation #1308

Closed

bhourahine reviewed Feb 5, 2024

View reviewed changes

thomas-niehaus self-assigned this Feb 8, 2024