Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad Performance of Parallelization with On-the-fly Transformation #144

Open
yuxuanzhuang opened this issue Sep 18, 2020 · 10 comments
Open

Comments

@yuxuanzhuang
Copy link
Contributor

yuxuanzhuang commented Sep 18, 2020

Expected behaviour

Analysis of a Universe with on-the-fly transformation scales good (reasonable).

Actual behaviour

The scaling performance is really bad even with two cores.

Code

import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['LONG_TRAJ']) #  9000 frames

fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

n_jobs = [1, 2, 4, 8, 16, 32, 64]

rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
               n_jobs=nj) #  timeit

Reason

In some Transformations includes numpy.dot which itself is multi-threaded. So the cores are oversubscribed.

Possible solution

Benchmarking result

Linear Scaling
RMSD with Transformation Comparison

  • Benchmarking system:
    • AMD EPYC 7551 32-Core Processor
    • RTX 2080 Ti
    • cephfs file system

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev
(run python -c "import pmda; print(pmda.__version__)")
(run python -c "import dask; print(dask.__version__)")

@orbeckst
Copy link
Member

Your results are interesting.

What is the single blue bar "RMSD" result? Is this running the code in serial but allow multi threaded operations? I don't understand why it improves with n_cores/

Interestingly, cupy does not do better than single threaded numpy. This reminds me of results from a REU where Robert Delgado found the same http://doi.org/10.6084/m9.figshare.3823293.v1 because the data transfer to/from GPU was expensive.

@yuxuanzhuang
Copy link
Contributor Author

The blue bar is the RMSD without any Transformation, just as a reference.

@yuxuanzhuang
Copy link
Contributor Author

So ideally we would like to only use one thread for numpy here. But since it's a global setting, I am not sure it can be set locally.

@orbeckst
Copy link
Member

If you set it as an environment variable from Python, does it work?

Is there a with block (context manager) available to do it?

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented Sep 19, 2020

Actually there is one. threadpoolctl.
I am even able to use it as a decorator. The issue is, ideally, it should be able to decorate _single_frame (so everything inside would be limit to one thread), but I am not sure if there's a way to only do it once (e.g. inheritance from the Base class)

For the Transformation case, I guess it shouldn't be limited to only the parallel code, as it also improves (a lot) for serial code. So I am starting a PR (MDAnalysis/mdanalysis#2950) for that.

Also, I should double-check the benchmark with another benchmarking system.

@orbeckst
Copy link
Member

orbeckst commented May 4, 2021

@yuxuanzhuang have you tried PMDA again since PR MDAnalysis/mdanalysis#2950 got merged?

@yuxuanzhuang
Copy link
Contributor Author

yuxuanzhuang commented May 12, 2021

EDIT:
Okay, just ignore the following benchmark. I realize I was using the current pmda master branch, which does not support on-the-fly transformations (and just ignore it when reconstructing the Universe). Maybe we should push #132 forward since we will soon have a beta 2.0.0 MDA version release.

Sorry, I need to find some time to look into it. It is really weird to see parallel RMSD performs better with a single core... maybe I messed up something.

from MDAnalysis.analysis.rms import RMSD as serial_rmsd
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['SHORT_TRAJ'])
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

rmsd = serial_rmsd(u.atoms, u.atoms)
%time rmsd.run()

CPU times: user 20.6 s, sys: 961 ms, total: 21.5 s
Wall time: 19.7 s

rmsd = parallel_rmsd(u.atoms, u.atoms)
%time rmsd.run(n_blocks=1, n_jobs=1)

CPU times: user 11.4 s, sys: 0 ns, total: 11.4 s
Wall time: 11.4 s

@yuxuanzhuang
Copy link
Contributor Author

The performance in our cluster is pretty bumpy recently so I am using my NUC for benchmarking instead (with the code above), here is the result.
benchmarking
The bottleneck from transformations no longer exists.

@orbeckst
Copy link
Member

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

@yuxuanzhuang
Copy link
Contributor Author

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

The develop branch of MDA and PR #132 branch of PMDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants