Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting and linking against tcmalloc if present #961

Open
ipelupessy opened this issue Jan 25, 2022 · 14 comments
Open

Detecting and linking against tcmalloc if present #961

ipelupessy opened this issue Jan 25, 2022 · 14 comments

Comments

@ipelupessy
Copy link
Contributor

This issue serves as a reminder for a possible addition to the build system in the form of (optionally) detecting and
linking against the tcmalloc library (google thread aware allocator); as such it should probably wait for #950.

The reason for this is the improved performance for some parts of the xtp code (GW).
This can be seen in the attached timings.
alloc.pdf
(this plot shows timings of a xtp_tools -e dftgwbse task run for benzene molecule as a function of number of threads. Different
colors indicate different tasks in the run (DFT, GW or BSE) and linestyles indicate different allocater (default (ptmalloc), tcmalloc or jemalloc - though tcmalloc and jemalloc are mostly very similar. As can be seen the GW task in particular is sensistive to the
malloc used)

As an alternative; the code involved can be examined in more detail and maybe the offending allocation refactored out.

@JensWehner
Copy link
Member

out of curiosty, which options did you use to run GW

@ipelupessy
Copy link
Contributor Author

I used the same setup I think as in your benchmarks:
-c gwbse.gw.mode=G0W0 dftpackage.auxbasisset=aux-def2-tzvp gwbse.ranges=full

@JensWehner
Copy link
Member

okay very interesting.

@JensWehner
Copy link
Member

JensWehner commented Jan 27, 2022

I did not know GW had that many allocations especially with the ppm

@ipelupessy
Copy link
Contributor Author

just changing the gw.mode to evGW slows it down tremendously, but similar speedup for tcmalloc, eg. 1250 vs 1900 sec (for 2 core test)

@ipelupessy
Copy link
Contributor Author

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

@baumeier
Copy link
Member

evGW is "just" repeating the G0W0 steps several times, so I guess the scaling is the same-ish just overall N-times slower

@baumeier
Copy link
Member

but there are a lot of other options (some of which I am testing) - any particular suggestions for things to try?

have you tried exact instead of ppm? (before getting crazy with cda...)

@JensWehner
Copy link
Member

btw, which system size are you using?

@ipelupessy
Copy link
Contributor Author

I am testing with benzene here (also doing tests with naphtalene);
(indeed I figured that evGW has lot of repeat of the same calls)

exact has same pattern (550s vs 400s with tcmalloc);

cda is a lot slower again (5500 sec) - I think I also had to change to gwbse.gw.qp_solver=cda for this (otherwise it seemed to freeze at some point, but maybe I was too impatient). There is no difference in this case with tcmalloc

I also tried gwbse.gw.qp_solver=fixedpoint - the other option besides cda and grid for this) (with the default ppm, G0W0) this is a faster, and shows some speedup with tcmalloc (9.2 and 7.3 sec)

@baumeier
Copy link
Member

I am not sure, but qp_solver=cda should not be there. @JensWehner ?

From what you see there, the alloc business seems to be related to solving the qp equations on grid. The fixed point solver is expected to require by far fewer evaluations of sigma (but can find "wrong" solutions).

And as you saw yourself, CDA is a disaster. Grid is practically impossible to run. Maybe you can try with a very small molecule, like CO.

@JensWehner
Copy link
Member

qp_solver=cda should not be valid, I am surprised it worked. Yes CDA is not very performant, but that was expected.

@ipelupessy
Copy link
Contributor Author

yea, I think the option is fixed now

@junghans
Copy link
Member

junghans commented Jan 8, 2023

@ipelupessy the cmake refactor is done, find_package(Threads REQUIRED) only shows up once in the main CMakeLists.txt file, so adding tcmalloc support should be easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants