Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMSD with multicore and timeout functionality #113

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

Jnelen
Copy link
Contributor

@Jnelen Jnelen commented Mar 11, 2024

As we discussed in #105: here is my first shot at implementation of RMSD functions with multicore and timeout functionalities.

For now, I kept the rmsd_timeout function public, however we could also just make it private. This function supports a molref with multiple mols, something that the multicore function doesn't support directly (everything is done pairwise here). Although I guess there is an easy workaround for the user?
Maybe it's even better to implement this ourselves: if only one mol is supplied as the ref mol, extend the list to match the length of the mol list? (like this [refmol]*len(mols)). This would also give each molecule a seperate timeout. Having the possibility of a mollist in the rmsd_timeout function also made me return the raw result of rmsdwrapper (a list) instead of the value (because otherwise only one value would be returned). So if we make it private, it could clean up things and also emulate the regular rmsdwrapper functionality better! (I just came up with this while writing this description, if you like it I will implement it tomorrow!)

I also still need to add tests. I guess we could add a test for a timeout, a test for matching mols and molrefs list, ... But multicore might be harder to test?

Any feedback welcome as always!

Note: I intentionally kept all the new funtions and imports together at the bottom so it's more clear to see what's added. This makes Flake8 doesn't pass. MyPy also doesn't pass yet, but I'll get to fixing that tomorrow as well!

…tionality

Note: MyPy and Flake8 doesn't pass (yet)
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 98.65%. Comparing base (da6be7f) to head (5729f6b).
Report is 11 commits behind head on develop.

Additional details and impacted files

Copy link
Owner

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jnelen for the nice contribution! This is just a first quick pass (it's late...), I'll have a deeper look tomorrow.

spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
@RMeli
Copy link
Owner

RMeli commented Mar 11, 2024

@Jnelen sorry, I clicked "Update Branch" without too much thinking. You'll have to pull before pushing further changes, otherwise you will get an error. Sorry about that!

Jnelen and others added 2 commits March 12, 2024 21:38
General cleanup and improvements

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
Make rmsd_timeout a private function
MyPy still complains when comparing num_workers with os.cpu_count()
it now supports mol-mol, mol-list and list-list as input.
In the case of mol-list, all calculations are performed separately to take advantage of multiple cores
Copy link
Owner

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test the suggested changes. I'm a bit short in time these days so I did not test them myself, sorry.

spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
Refactor to prmsdwrapper
Some improvements to typing
@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 13, 2024

For the Value ctype I picked a regular, single precision float. However we could also use a double precision float if you think this is more appropriate!

@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 14, 2024

Thanks! Please don't waste too much time on mypy issues. It's supposed to be helpful, not annoying. If it becomes too much, we can simply turn it off for the offending line(s) and deal with it another time.

I just suppressed the 2 remaining MyPY errors. One of them should hopefully get fixed by MyPy (see the earlier issue I referred to), the other is the problem that os.cpu_count() can return None sometimes, which would give problems when using the min(num_workers, os.cpu_count()). I tried your patch, but it was still complaining.

Functionally I think it's basically how we want to have, but perhaps some parts of the implementation can still be cleaned up? Besides that, I suppose some tests should be added as well? The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place. But i'm not sure how to properly test the multicore feature?

By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time!

@RMeli
Copy link
Owner

RMeli commented Mar 14, 2024

Besides that, I suppose some tests should be added as well?

Yes, we need tests. =)

The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place).

Yes. I think the worst one you reported is a great candidate.

But i'm not sure how to properly test the multicore feature?

I think GitHub Actions have access to multiple cores, but I should double check. In any case, the test would be on correctness.

Eventually we could also add a benchmark test, but it's not a priority IMO. If you have some results from your work (where you see an improvement), you can just post the timings here for future reference. But not a priority, the important thing is that it's correct.

By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time!

Thanks. You are doing great work, so I don't want it to stall. ;)

spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated Show resolved Hide resolved
spyrmsd/rmsd.py Outdated
Comment on lines 540 to 548
# Cast the molecules to lists if they aren't already
if not isinstance(molrefs, list):
molrefs = [molrefs]
if not isinstance(mols, list):
mols = [mols]

# Match the length of the molref
if len(molrefs) == 1 and len(molrefs) < len(mols):
molrefs = molrefs * len(mols)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want to do is to parallelise on graph isomorphisms, since computing the RMSD is rather fast with NumPy vectorisation. Therefore, I think the use case of multiple references each with multiple different poses (i.e. in the case of docking), is an important use case that is not covered, is it?

Copy link
Contributor Author

@Jnelen Jnelen Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behaviour is like this: If a user inputs 1 molref and a list of mols ([mol1, mol2, ...] to compare it to (so this could be different docked poses), what happens is that multiple rmsdwrapper calculations are performed across different cores like this:

rmsdwrapper(molref, mol1) #core 1
rmsdwrapper(molref, mol2) #core 2
...

So this is just using an embarrassingly parallel approach using the "regular" rmsdwrapper, the graph isomorphisms calculations aren't even really being parallelized explicitly. Thinking more about it, I'm honestly not even sure how this interacts with the graph caching either (because of the multicore aspect).
My implementation had mostly my own use-case in mind, which I thought would generalize decently well. Basically I wanted to accelerate the workload of calculating the RMSD of a few ligandposes from different docking methods after a virtual screening run. However, considering your insights, maybe my approach was/is too simplistic and needs to be improved. I do think this would introduce even more complexity, so we will need to think things through very carefully (so basically continue at it like we have been since the start haha).

spyrmsd/rmsd.py Outdated Show resolved Hide resolved
@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 17, 2024

I added some very basic tests for the prmsdwrapper function. (I basically copied the tests with the rmsdwrapper function and used the prsmdwrapper function instead). I also added a test for the timeout function. However one thing I noticed is that for some reason this runs twice: once with a True and once with a False variable. When the tests were giving errors I got this:

FAILED tests/test_rmsd.py::test_prmsdwrapper_single_molecule_timeout[True] - assert nan is nan
FAILED tests/test_rmsd.py::test_prmsdwrapper_single_molecule_timeout[False] - assert nan is nan

I fixed the error of the assert, but the test still runs twice. I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?

Ofcourse there is still some work left improving the actual prmsdwrapper function, but I suppose some basic tests can also add sanity checks to ensure we're not breaking stuff.

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
@RMeli
Copy link
Owner

RMeli commented Mar 18, 2024

Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.

I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?

Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file (autouse=True).

@pytest.fixture(autouse=True, params=[True, False])
def lambda_max_failure(monkeypatch, request):
"""
Monkey patch fixture for :code:`lambda_max` function to simulate convergence
failures.

It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening!

@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 18, 2024

Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.

No problem at all. Actually, I will probably also be quite inactive until Wednesday next week. I should still be able to respond to some comments, but I probably won't be able to change or test any new code until then. Just as a heads-up!

I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?

Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file (autouse=True).

@pytest.fixture(autouse=True, params=[True, False])
def lambda_max_failure(monkeypatch, request):
"""
Monkey patch fixture for :code:`lambda_max` function to simulate convergence
failures.

It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening!

Ah that explains it, thanks for the clarification!

@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 28, 2024

I'm back from my break! Right before that I actually updated my use-case (ESSENCE-Dock) to work with a time-out and multicore, but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency.

However this got me thinking if we should change the implementation of the current multicore and timeout implementation. Maybe I had my (specific) use-case too much in mind? It definitely is functional, and will work just fine to process a large number of molecules across different CPUs. However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)? However, I could also see how this could lead to problems with the current multicore implementation, since it will want to use multiple cores in each process? That's why I'm wondering if we should make larger changes to the current version.

Finally as for a benchmark, I haven't ran a proper one yet, but I'll try to run one later with the current implementation.

@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 30, 2024

Hi,

I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times. Here is a concrete example for the same dataset.
Running it the first time:

real    0m30.536s
user    0m17.145s
sys     0m2.811s

Versus the second (and subsequent) time(s):

real    0m11.677s
user    0m11.957s
sys     0m2.657s

Do you know why this is? I'm not sure, but I think this is because of the loading of the molecules, and that in subsequent runs these are maybe cached? Also note the discrepancy between the real time and the user time in the first run. I tried several different datasets, and this pattern seems to be very common. Do you know what could be happening?

I will try to follow-up later with some concrete "benchmarks" and more context!

@Jnelen
Copy link
Contributor Author

Jnelen commented Mar 30, 2024

Alright, I did some testing, and the performance of the parallel implementation is a lot worse than I thought it would be to be honest.. In the edge-cases tests I was doing it seemed good, because even if some "problematic" molecules were present it would still finish in a reasonable amount of time, where as with the "regular" implementation it would get killed or take a very long time. However, in more "normal" scenarios, the current prmsdwrapper seems to degrade the performance by a lot. I used some docking examples from DUD-E to test. This example is for HS90A, where I used the LeadFinder (LF) and Gnina (GN) results to calculate the RMSD for 4938 compounds. You can find the scripts I used here: benchmark_scripts.zip.

Anyway, here are my numbers:

  • Regular (rmsdwrapper)
real    0m11.990s
user    0m12.207s
sys     0m2.473s
  • Parallel (1 Core)
real    2m56.928s
user    0m25.358s
sys     2m39.194s
  • Parallel (4 Cores)
real    0m51.843s
user    0m24.107s
sys     2m45.873s
  • Parallel (8 Cores)
real    0m29.346s
user    0m23.753s
sys     2m40.650s
  • Parallel (16 Cores)
real    0m18.365s
user    0m21.712s
sys     2m37.021s

Something else I noticed is that error handling in the prmsdwrapper should be improved as well. To me it seems like in the current implementation it seems to almost never really be worth it to use prmsdwrapper, except when there are many edge-cases that can take a very long time or even kill the python process. However at this point it might be better for the user to implement specific changes/checks? I think our concept of the prmsdwrapper is still a good idea, but my implementation seems to be performing very suboptimal.. I would love to have a clean and efficient prmsdwrapper function merged, but this definitely isn't it. I also have many other projects coming up, so unfortunately I think that I will have less time to allocate for this as well. What do you think?

@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 10, 2024

Thanks @Jnelen! I'll review the main function ASAP, sorry for the delay.

There is no rush at all! Please only do it when you have the time!

Yes, it needs to be tested in CI. Since it's an optional dependency we need CI configurations both with and without. But since your design is completely orthogonal, I don't think we need to add too many. I'd suggest to use pebble in the RDKit configurations and not use it with OpenBabel configurations, so that we don't expand the CI matrix. What do you think? The files to add it to would be spyrmsd-test-rdkit-all.yaml and spyrmsd-test-rdkit-nogt.yaml.

Perfect, I just added it!

Copy link
Owner

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jnelen, looking good! I left some comments, but we are getting there, thanks for your patience.

spyrmsd/parallel.py Outdated Show resolved Hide resolved
num_workers = min(num_workers, os.cpu_count()) if os.cpu_count() is not None else 1 # type: ignore[type-var]

# Cast the molecules to lists if they aren't already
if not isinstance(molrefs, list):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelisation is performed over graph isomorphisms, so I don't think a single reference structure is particularly interesting to have. Is this for the timeout functionality only? If yes, I might be worth adding a note in the docstring, explaining the two (non-exclusive) functionalities of this function and that parallelisation only makes sense for multiple molecules.

Maybe it adds useless overhead, but it might be worth checking num_workers and raising a warning if a single molecules is passed with num_workers > 1? But maybe it's not worth it, because it would prevent to set num_workers = None by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there is only one structure it makes less sense to use this method, unless you want to use the timeout feature like you mentioned. Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel. (For example if you docked a structure 100 times, and want to compare the top structure directly with all the other compounds, it would treat every RMSD calculation seperately and run in parallel across multiple cores). In this case, it would basically match the single input structure to all the reference molecules for you, and calculate it seperately using rmsdwrapper. So rmsdwrapper(mol, refmol1), rmsdwrapper(mol, refmol2), ... is computed across all the available cores.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario could be where you want to check just one structure against a lot of reference structures, which would also be executed in parallel.

Yes, but in this case you would have to re-compute the graph isomorphism each time, which might be the most expensive thing. I'm starting to wonder if these two cases (one reference and many many corresponding structures, or multiple references with few corresponding structures) are different enough to distinguish them? The first case would definitely benefit to compute the graphs isomorphisms once, and then parallelise over the RMSD calculations (although these are quite fast, so the overhead of parallelisation might be too much...).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Do you have a feel about how much is the parallel overhead compared to just the RMSD calculation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the parallel method is "stupid" in the sense that it re-computes the graph isomorphism each time (which I agree is not efficient). However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?). Also the graph isomorphism calculations aren't perfomed in a multicore manner either. In an ideal world the isormorphisms would be calculated using all the assigned num_workers, and be cached. After that the RMSDs could be computed across all num_workers (single core) and reuse the cached isomorphisms until all calculations are finished. However I don't have the bandwidth to work on that. The initial goal for me was to have a function with timeout function so it can run more stable on libraries that have "problematic" molecules (like Drugbank).
Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0). I would say around a 10-20% overhead maybe? Here are the benchmarks I shared before:

  • regular rmsdwrapper function: 12.00s
  • pebblermsdwrapper with num_worker=1: 14.27s
  • pebblermsdwrapper with num_worker=2: 7.69s
  • pebblermsdwrapper with num_worker=4: 4.93s

So using 2 Cores already gives you a faster performance than 1 (which in my opinion seems already worth it, especially with the timeout feature)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I think that for this there would need to be made some more changes (maybe cache it somehow so it can be reused?).

Yes, this is how a standard RMSD calculation already works, if you are passing a single reference and many structures to compare it with. The isomorphism is computed only once and cached:

spyrmsd/spyrmsd/rmsd.py

Lines 239 to 240 in 62c5df3

cache: bool
Cache graph isomorphisms

So one would have to use _rmsd_isomorphic_core in a similar way in order to implement parallelisation over RMSD calculations with a single graph isomorphism.

However I don't have the bandwidth to work on that.

Totally understandable. And so that there are no misunderstanding, I'm not suggesting to implement both directly, but to narrow the scope of the one you are implementing (parallelisation on graph isomorphism, and timeout), and leave the other less interesting case (parallelisation over RMSD calculations with the same isomorphism) as a TODO.

Also for your question about overhead: with pebble there still is some overhead, but compared to spawning the processes manually as I did in the first try, it looks to be relatively small (but ofcourse not 0).

Yes, I was wondering how the overhead compares with RMSD calculations only, instead of graph isomorphism + RMSD. The former is vectorised with numpy, so it should be very fast (and the overhead would have much more impact).

spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Show resolved Hide resolved
Improve prmsdwrapper feedback
@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 11, 2024

I just pushed an update where I tried to address most of your comments.
I think the main thing still is fixing the (potential) issues with chunksize >1 and adding more tests for prmsdwrapper (especially regarding chunksize..). I also opened an issue in the pebble repository to see if he can give some input (there was no discussion page), since it's quite confusing. In the absolute worst case scenario where we can't make it work properly we can restrict the chunksize I suppose, and make it flexible once we find a (proper) fix

add shorter warning when not all compounds were successfully processed
@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 14, 2024

Alright, I made some minor changes (I added a check as you suggested), and also changed the printing to a more compact warning.
I'm thinking maybe add a test where the chunksize is tested (with and without a timeout)?
What other concrete changes should be made in your opinion for our redefined scope?

@RMeli
Copy link
Owner

RMeli commented Apr 15, 2024

I'm thinking maybe add a test where the chunksize is tested (with and without a timeout)?

That would be great!

What other concrete changes should be made in your opinion for our redefined scope?

I think we can simplify the interface to accept a list of reference and a list or a list of a list of molecules?

    molrefs: List[molecule.Molecule],
    mols: Union[List[molecule.Molecule], List[List[molecule.Molecule]]],

Is that compatible with your use case? Otherwise we can keep it as-is, and I can eventually clean it up later if needed.

@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 15, 2024

I'm thinking maybe add a test where the chunksize is tested (with and without a timeout)?

That would be great!

Alright, I'll try to think about it and see what I can come up with.

What other concrete changes should be made in your opinion for our redefined scope?

I think we can simplify the interface to accept a list of reference and a list or a list of a list of molecules?

    molrefs: List[molecule.Molecule],
    mols: Union[List[molecule.Molecule], List[List[molecule.Molecule]]],

Is that compatible with your use case? Otherwise we can keep it as-is, and I can eventually clean it up later if needed.

Yes it is, but in that case if you pass a single molecule, the user just has to pass it as a list (which shouldn't be a problem if it's documented properly).

I also got a response on the issue I opened on Pebble. When an error or a timeout occurs, the whole chunk should fail according to the author. This is also initially how I thought it worked, so I will try to test it again with a fresh mind (and maybe also just a simplified example without spyrmsd) to be sure and debug it. If that's the case, I can hopefully add the flexibility of the chunksize parameter again.

However I'm quite busy this week, so not sure when I will get around to it. Maybe I find some time tomorrow, otherwise It'll be later this week.

@RMeli
Copy link
Owner

RMeli commented Apr 15, 2024

Thanks for the update @Jnelen, sounds great.

However I'm quite busy this week, so not sure when I will get around to it. Maybe I find some time tomorrow, otherwise It'll be later this week.

No problem, don't sweat it. There is no rush, and there is not timer attached to an open PR! ;) spyrmsd isn't going anywhere anytime soon (I hope!).

Check if proper warnings are raised (on some compound failures as well as setting chunksize and timeout)
Check if setting different chunksize works in test scenario (with and without timeout)
@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 22, 2024

I just added some more tests for prmsdwrapper. Mainly checking if proper warnings are raised, and that different chunksizes work (when no timeout is set). What else do you feel like should be improved before we can merge? I feel we're mostly there, but would like your input!

Also for the pebble interaction with errors and timeouts, I updated my initial issue with a simpler minimal working example. I'm curious what the response will be, but for now I'd like to try and get this PR finished up!

@RMeli
Copy link
Owner

RMeli commented Apr 29, 2024

@Jnelen apologies for the massive delay here! The kid was sick, then I was away for work, and finally I was sick too. I should have some time to have a proper and final look at this towards the end of the week.

@Jnelen
Copy link
Contributor Author

Jnelen commented Apr 29, 2024

@Jnelen apologies for the massive delay here! The kid was sick, then I was away for work, and finally I was sick too. I should have some time to have a proper and final look at this towards the end of the week.

No need to apologize friend, hope you and your kid are feeling better now! Take your time and I'll get back to it when you have found the time to take a look!

Copy link
Owner

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments while I'm slowly getting back into this. I'm really sorry for the massive delay!

spyrmsd/parallel.py Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
@Jnelen
Copy link
Contributor Author

Jnelen commented May 13, 2024

Just a few comments while I'm slowly getting back into this. I'm really sorry for the massive delay!

Don't worry about it! I'll try to take a detailed look at your comments tomorrow evening.

Make warning message more comprehensive
@Jnelen
Copy link
Contributor Author

Jnelen commented May 16, 2024

I simplified the num_workers aspect of the function, thanks for pointing that out!
Do you think of anything else that should be added? For example, is trying to use all the threads as the standard option fine, or should that be changed?

Also tests are failing on github CI, but the problem seems to be with the setup and installation of the packages? Did something about the configuration/setup change?

@RMeli RMeli closed this May 28, 2024
@RMeli RMeli reopened this May 28, 2024
@RMeli
Copy link
Owner

RMeli commented May 28, 2024

Cycling to restart CI.

Copy link
Owner

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for the delay @Jnelen, and thank you for your patience. I've been swamped both at work and outside of work.

I had another good look at the PR and left quite a bit of comments, but I think we are getting there. I'll play around a bit more with it in the coming days and eventually add more comments are needed.

tests/test_parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Outdated Show resolved Hide resolved
tests/test_parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
spyrmsd/parallel.py Outdated Show resolved Hide resolved
@Jnelen
Copy link
Contributor Author

Jnelen commented May 29, 2024

Sorry again for the delay @Jnelen, and thank you for your patience. I've been swamped both at work and outside of work.

I had another good look at the PR and left quite a bit of comments, but I think we are getting there. I'll play around a bit more with it in the coming days and eventually add more comments are needed.

Thanks a lot for your review! I'll try to work on and improve it later this week(end).

@Jnelen
Copy link
Contributor Author

Jnelen commented May 30, 2024

Thanks for your thorough review @RMeli! I tried to address most issues. I added an example test for the more realistic test case (with chunksize=2 and a timeout), but I made it skip, since in the current implementation we force the chunksize to be 1 when a timeout is set. However, I think that for debugging/developing purposes this can still be interesting, which is why I added it anyway (and made it skip so it doesn't give an error for now).

@RMeli
Copy link
Owner

RMeli commented May 30, 2024

However, I think that for debugging/developing purposes this can still be interesting, which is why I added it anyway (and made it skip so it doesn't give an error for now).

I agree on the interest, but I think it would be better to leave it as a comment here in this PR instead of within the codebase.

As I mentioned in the last comment, I completely forgot we went for exclusivity of the two features, so some of my comments were a bit confusing/confused. Apologies for the confusion.

@RMeli
Copy link
Owner

RMeli commented May 30, 2024

Test for the hypothetical case where a timeout on chunks of size 2 makes the whole test fail. The test checks that the whole chunk is set to np.nan.

Currently, specifying a timeout will automatically change the chunk size to 1.

@pytest.mark.skip(
    reason="Chunksize is currently automatically set to 1 when a timeout is set"
)
def test_prmsdwrapper_mixed_timeout(muparfostat, benzene) -> None:
    muparfostat_mol = copy.deepcopy(muparfostat)
    benzene_mol = copy.deepcopy(benzene)

    lst_1 = [
        muparfostat_mol,
        benzene_mol,
        benzene_mol,
        benzene_mol,
        benzene_mol,
        muparfostat_mol,
    ]
    lst_2 = [
        muparfostat_mol,
        benzene_mol,
        benzene_mol,
        benzene_mol,
        benzene_mol,
        muparfostat_mol,
    ]

    # Currently we force the num_workers to be 1 when there is a timeout set, but this could make it easier to debug things in the future
    RMSDlist = prmsdwrapper(lst_1, lst_2, timeout=1e-3, num_workers=1, chunksize=2)

    expectedResult = [np.nan, np.nan, 0, 0, np.nan, np.nan]
    np.testing.assert_allclose(RMSDlist, expectedResult, atol=1e-5)

@Jnelen
Copy link
Contributor Author

Jnelen commented May 31, 2024

I added some clarifications in the code! I saw we still had some conversations open, but they were mostly discussions about things that can improved in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants