-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSD with multicore and timeout functionality #113
base: develop
Are you sure you want to change the base?
Conversation
…tionality Note: MyPy and Flake8 doesn't pass (yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jnelen for the nice contribution! This is just a first quick pass (it's late...), I'll have a deeper look tomorrow.
@Jnelen sorry, I clicked "Update Branch" without too much thinking. You'll have to pull before pushing further changes, otherwise you will get an error. Sorry about that! |
General cleanup and improvements Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Make rmsd_timeout a private function MyPy still complains when comparing num_workers with os.cpu_count()
it now supports mol-mol, mol-list and list-list as input. In the case of mol-list, all calculations are performed separately to take advantage of multiple cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test the suggested changes. I'm a bit short in time these days so I did not test them myself, sorry.
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Refactor to prmsdwrapper Some improvements to typing
For the Value ctype I picked a regular, single precision float. However we could also use a double precision float if you think this is more appropriate! |
I just suppressed the 2 remaining MyPY errors. One of them should hopefully get fixed by MyPy (see the earlier issue I referred to), the other is the problem that os.cpu_count() can return Functionally I think it's basically how we want to have, but perhaps some parts of the implementation can still be cleaned up? Besides that, I suppose some tests should be added as well? The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place. But i'm not sure how to properly test the multicore feature? By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time! |
Yes, we need tests. =)
Yes. I think the worst one you reported is a great candidate.
I think GitHub Actions have access to multiple cores, but I should double check. In any case, the test would be on correctness. Eventually we could also add a benchmark test, but it's not a priority IMO. If you have some results from your work (where you see an improvement), you can just post the timings here for future reference. But not a priority, the important thing is that it's correct.
Thanks. You are doing great work, so I don't want it to stall. ;) |
spyrmsd/rmsd.py
Outdated
# Cast the molecules to lists if they aren't already | ||
if not isinstance(molrefs, list): | ||
molrefs = [molrefs] | ||
if not isinstance(mols, list): | ||
mols = [mols] | ||
|
||
# Match the length of the molref | ||
if len(molrefs) == 1 and len(molrefs) < len(mols): | ||
molrefs = molrefs * len(mols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we want to do is to parallelise on graph isomorphisms, since computing the RMSD is rather fast with NumPy vectorisation. Therefore, I think the use case of multiple references each with multiple different poses (i.e. in the case of docking), is an important use case that is not covered, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behaviour is like this: If a user inputs 1 molref and a list of mols ([mol1, mol2, ...]
to compare it to (so this could be different docked poses), what happens is that multiple rmsdwrapper
calculations are performed across different cores like this:
rmsdwrapper(molref, mol1) #core 1
rmsdwrapper(molref, mol2) #core 2
...
So this is just using an embarrassingly parallel approach using the "regular" rmsdwrapper
, the graph isomorphisms calculations aren't even really being parallelized explicitly. Thinking more about it, I'm honestly not even sure how this interacts with the graph caching either (because of the multicore aspect).
My implementation had mostly my own use-case in mind, which I thought would generalize decently well. Basically I wanted to accelerate the workload of calculating the RMSD of a few ligandposes from different docking methods after a virtual screening run. However, considering your insights, maybe my approach was/is too simplistic and needs to be improved. I do think this would introduce even more complexity, so we will need to think things through very carefully (so basically continue at it like we have been since the start haha).
I added some very basic tests for the prmsdwrapper function. (I basically copied the tests with the rmsdwrapper function and used the prsmdwrapper function instead). I also added a test for the timeout function. However one thing I noticed is that for some reason this runs twice: once with a True and once with a False variable. When the tests were giving errors I got this:
I fixed the error of the assert, but the test still runs twice. I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement? Ofcourse there is still some work left improving the actual prmsdwrapper function, but I suppose some basic tests can also add sanity checks to ensure we're not breaking stuff. |
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.
Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file ( Lines 11 to 15 in 08764e1
It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening! |
No problem at all. Actually, I will probably also be quite inactive until Wednesday next week. I should still be able to respond to some comments, but I probably won't be able to change or test any new code until then. Just as a heads-up!
Ah that explains it, thanks for the clarification! |
I'm back from my break! Right before that I actually updated my use-case (ESSENCE-Dock) to work with a time-out and multicore, but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency. However this got me thinking if we should change the implementation of the current multicore and timeout implementation. Maybe I had my (specific) use-case too much in mind? It definitely is functional, and will work just fine to process a large number of molecules across different CPUs. However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)? However, I could also see how this could lead to problems with the current multicore implementation, since it will want to use multiple cores in each process? That's why I'm wondering if we should make larger changes to the current version. Finally as for a benchmark, I haven't ran a proper one yet, but I'll try to run one later with the current implementation. |
Hi, I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times. Here is a concrete example for the same dataset.
Versus the second (and subsequent) time(s):
Do you know why this is? I'm not sure, but I think this is because of the loading of the molecules, and that in subsequent runs these are maybe cached? Also note the discrepancy between the real time and the user time in the first run. I tried several different datasets, and this pattern seems to be very common. Do you know what could be happening? I will try to follow-up later with some concrete "benchmarks" and more context! |
Alright, I did some testing, and the performance of the parallel implementation is a lot worse than I thought it would be to be honest.. In the edge-cases tests I was doing it seemed good, because even if some "problematic" molecules were present it would still finish in a reasonable amount of time, where as with the "regular" implementation it would get killed or take a very long time. However, in more "normal" scenarios, the current prmsdwrapper seems to degrade the performance by a lot. I used some docking examples from DUD-E to test. This example is for HS90A, where I used the LeadFinder (LF) and Gnina (GN) results to calculate the RMSD for 4938 compounds. You can find the scripts I used here: benchmark_scripts.zip. Anyway, here are my numbers:
Something else I noticed is that error handling in the |
There is no rush at all! Please only do it when you have the time!
Perfect, I just added it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jnelen, looking good! I left some comments, but we are getting there, thanks for your patience.
num_workers = min(num_workers, os.cpu_count()) if os.cpu_count() is not None else 1 # type: ignore[type-var] | ||
|
||
# Cast the molecules to lists if they aren't already | ||
if not isinstance(molrefs, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallelisation is performed over graph isomorphisms, so I don't think a single reference structure is particularly interesting to have. Is this for the timeout functionality only? If yes, I might be worth adding a note in the docstring, explaining the two (non-exclusive) functionalities of this function and that parallelisation only makes sense for multiple molecules.
Maybe it adds useless overhead, but it might be worth checking num_workers
and raising a warning if a single molecules is passed with num_workers > 1
? But maybe it's not worth it, because it would prevent to set num_workers = None
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there is only one structure it makes less sense to use this method, unless you want to use the timeout feature like you mentioned. Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel. (For example if you docked a structure 100 times, and want to compare the top structure directly with all the other compounds, it would treat every RMSD calculation seperately and run in parallel across multiple cores). In this case, it would basically match the single input structure to all the reference molecules for you, and calculate it seperately using rmsdwrapper
. So rmsdwrapper(mol, refmol1)
, rmsdwrapper(mol, refmol2)
, ... is computed across all the available cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel.
Yes, but in this case you would have to re-compute the graph isomorphism each time, which might be the most expensive thing. I'm starting to wonder if these two cases (one reference and many many corresponding structures, or multiple references with few corresponding structures) are different enough to distinguish them? The first case would definitely benefit to compute the graphs isomorphisms once, and then parallelise over the RMSD calculations (although these are quite fast, so the overhead of parallelisation might be too much...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Do you have a feel about how much is the parallel overhead compared to just the RMSD calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the parallel method is "stupid" in the sense that it re-computes the graph isomorphism each time (which I agree is not efficient). However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?). Also the graph isomorphism calculations aren't perfomed in a multicore manner either. In an ideal world the isormorphisms would be calculated using all the assigned num_workers, and be cached. After that the RMSDs could be computed across all num_workers (single core) and reuse the cached isomorphisms until all calculations are finished. However I don't have the bandwidth to work on that. The initial goal for me was to have a function with timeout
function so it can run more stable on libraries that have "problematic" molecules (like Drugbank).
Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0). I would say around a 10-20% overhead maybe? Here are the benchmarks I shared before:
- regular
rmsdwrapper
function: 12.00s pebblermsdwrapper
with num_worker=1: 14.27spebblermsdwrapper
with num_worker=2: 7.69spebblermsdwrapper
with num_worker=4: 4.93s
So using 2 Cores already gives you a faster performance than 1 (which in my opinion seems already worth it, especially with the timeout feature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?).
Yes, this is how a standard RMSD calculation already works, if you are passing a single reference and many structures to compare it with. The isomorphism is computed only once and cached:
Lines 239 to 240 in 62c5df3
cache: bool | |
Cache graph isomorphisms |
So one would have to use _rmsd_isomorphic_core
in a similar way in order to implement parallelisation over RMSD calculations with a single graph isomorphism.
However I don't have the bandwidth to work on that.
Totally understandable. And so that there are no misunderstanding, I'm not suggesting to implement both directly, but to narrow the scope of the one you are implementing (parallelisation on graph isomorphism, and timeout), and leave the other less interesting case (parallelisation over RMSD calculations with the same isomorphism) as a TODO.
Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0).
Yes, I was wondering how the overhead compares with RMSD calculations only, instead of graph isomorphism + RMSD. The former is vectorised with numpy
, so it should be very fast (and the overhead would have much more impact).
Improve prmsdwrapper feedback
I just pushed an update where I tried to address most of your comments. |
add shorter warning when not all compounds were successfully processed
Alright, I made some minor changes (I added a check as you suggested), and also changed the printing to a more compact warning. |
That would be great!
I think we can simplify the interface to accept a list of reference and a list or a list of a list of molecules? molrefs: List[molecule.Molecule],
mols: Union[List[molecule.Molecule], List[List[molecule.Molecule]]], Is that compatible with your use case? Otherwise we can keep it as-is, and I can eventually clean it up later if needed. |
Alright, I'll try to think about it and see what I can come up with.
Yes it is, but in that case if you pass a single molecule, the user just has to pass it as a list (which shouldn't be a problem if it's documented properly). I also got a response on the issue I opened on However I'm quite busy this week, so not sure when I will get around to it. Maybe I find some time tomorrow, otherwise It'll be later this week. |
Thanks for the update @Jnelen, sounds great.
No problem, don't sweat it. There is no rush, and there is not timer attached to an open PR! ;) |
Check if proper warnings are raised (on some compound failures as well as setting chunksize and timeout) Check if setting different chunksize works in test scenario (with and without timeout)
I just added some more tests for Also for the |
@Jnelen apologies for the massive delay here! The kid was sick, then I was away for work, and finally I was sick too. I should have some time to have a proper and final look at this towards the end of the week. |
No need to apologize friend, hope you and your kid are feeling better now! Take your time and I'll get back to it when you have found the time to take a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments while I'm slowly getting back into this. I'm really sorry for the massive delay!
Don't worry about it! I'll try to take a detailed look at your comments tomorrow evening. |
Make warning message more comprehensive
I simplified the num_workers aspect of the function, thanks for pointing that out! Also tests are failing on github CI, but the problem seems to be with the setup and installation of the packages? Did something about the configuration/setup change? |
Cycling to restart CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the delay @Jnelen, and thank you for your patience. I've been swamped both at work and outside of work.
I had another good look at the PR and left quite a bit of comments, but I think we are getting there. I'll play around a bit more with it in the coming days and eventually add more comments are needed.
Thanks a lot for your review! I'll try to work on and improve it later this week(end). |
Thanks for your thorough review @RMeli! I tried to address most issues. I added an example test for the more realistic test case (with chunksize=2 and a timeout), but I made it skip, since in the current implementation we force the chunksize to be 1 when a timeout is set. However, I think that for debugging/developing purposes this can still be interesting, which is why I added it anyway (and made it skip so it doesn't give an error for now). |
I agree on the interest, but I think it would be better to leave it as a comment here in this PR instead of within the codebase. As I mentioned in the last comment, I completely forgot we went for exclusivity of the two features, so some of my comments were a bit confusing/confused. Apologies for the confusion. |
Test for the hypothetical case where a timeout on chunks of size 2 makes the whole test fail. The test checks that the whole chunk is set to Currently, specifying a timeout will automatically change the chunk size to @pytest.mark.skip(
reason="Chunksize is currently automatically set to 1 when a timeout is set"
)
def test_prmsdwrapper_mixed_timeout(muparfostat, benzene) -> None:
muparfostat_mol = copy.deepcopy(muparfostat)
benzene_mol = copy.deepcopy(benzene)
lst_1 = [
muparfostat_mol,
benzene_mol,
benzene_mol,
benzene_mol,
benzene_mol,
muparfostat_mol,
]
lst_2 = [
muparfostat_mol,
benzene_mol,
benzene_mol,
benzene_mol,
benzene_mol,
muparfostat_mol,
]
# Currently we force the num_workers to be 1 when there is a timeout set, but this could make it easier to debug things in the future
RMSDlist = prmsdwrapper(lst_1, lst_2, timeout=1e-3, num_workers=1, chunksize=2)
expectedResult = [np.nan, np.nan, 0, 0, np.nan, np.nan]
np.testing.assert_allclose(RMSDlist, expectedResult, atol=1e-5) |
I added some clarifications in the code! I saw we still had some conversations open, but they were mostly discussions about things that can improved in the future. |
As we discussed in #105: here is my first shot at implementation of RMSD functions with multicore and timeout functionalities.
For now, I kept the
rmsd_timeout
function public, however we could also just make it private. This function supports a molref with multiple mols, something that the multicore function doesn't support directly (everything is done pairwise here). Although I guess there is an easy workaround for the user?Maybe it's even better to implement this ourselves: if only one mol is supplied as the ref mol, extend the list to match the length of the mol list? (like this
[refmol]*len(mols)
). This would also give each molecule a seperate timeout. Having the possibility of a mollist in thermsd_timeout
function also made me return the raw result of rmsdwrapper (a list) instead of the value (because otherwise only one value would be returned). So if we make it private, it could clean up things and also emulate the regular rmsdwrapper functionality better! (I just came up with this while writing this description, if you like it I will implement it tomorrow!)I also still need to add tests. I guess we could add a test for a timeout, a test for matching mols and molrefs list, ... But multicore might be harder to test?
Any feedback welcome as always!
Note: I intentionally kept all the new funtions and imports together at the bottom so it's more clear to see what's added. This makes Flake8 doesn't pass. MyPy also doesn't pass yet, but I'll get to fixing that tomorrow as well!