Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing is slow #329

Open
JulienT01 opened this issue Jul 7, 2023 · 4 comments
Open

Multiprocessing is slow #329

JulienT01 opened this issue Jul 7, 2023 · 4 comments
Assignees
Labels
discussion This issue needs further discussion enhancement New feature or request Marathon To do during Marathon

Comments

@JulienT01
Copy link
Collaborator

JulienT01 commented Jul 7, 2023

running "ltest_dqn_vs_mdqn_acrobot.py" with 10000 budget.

doing n_fit=4 is longer than 2* n_fit=2 when using parallelization="process"

TODO : add regression test 2fit faster than 2*1fit (with multiprocessing)

@TimotheeMathieu
Copy link
Collaborator

From the tests I did, it seems to be a conflict between python multiprocessing and pytorch multiprocessing.

I just tried by replacing everything multiprocessing in AgentManager by joblib and there is no problem anymore, n_fit=4 is faster than 2 times n_fit=2.

@omardrwch : why did you choose not to use joblib before ? It is a lot simpler to code, and I don't see why you would need multiprocessing instead.

@omardrwch
Copy link
Member

Hello! Actually, in the very first implementation of AgentManager, I was using joblib. But - at least at that time - there was a problem with jobs that were creating subprocesses themselves (i.e., if an Agent created by an AgentManager creates new processes). If I remember correctly, I got the error daemonic processes are not allowed to have children.

Another advantage of multiprocessing is that possibility of using spawn, which is more robust (each agent basically having its own interpreter), e.g. https://stackoverflow.com/a/66113051.

We could maybe add a parallelization = "joblib" option in AgentManager, but I think it's important to keep Python's multiprocessing as an option for those reasons.

@riiswa
Copy link
Collaborator

riiswa commented Jul 10, 2023

Another suggestion can be to use the multiprocessing subpackage of PyTorch (https://pytorch.org/docs/stable/multiprocessing.html#module-torch.multiprocessing) instead of the std one.

Little document about multiprocessing best practices in Pytorch : https://pytorch.org/docs/stable/notes/multiprocessing.html

@KohlerHECTOR KohlerHECTOR added enhancement New feature or request discussion This issue needs further discussion Marathon To do during Marathon labels Jul 13, 2023
@KohlerHECTOR KohlerHECTOR added this to To do in Marathon rlberry Jul 13, 2023
@riiswa riiswa moved this from To do to In progress in Marathon rlberry Jul 24, 2023
@riiswa riiswa self-assigned this Jul 24, 2023
@RockmanZheng
Copy link

RockmanZheng commented Oct 24, 2023

Hi @omardrwch. I have been using rlberry for some time. I have encountered an issue that could be related to the multiprocessing module mentioned in this issue. I was running 20 simple bandit experiments with 250 horizon with 1000 workers (simulations). The simulation times took longer and longer, where in the first experiment it took 7s to complete one simulation while it took 137s for one run in the 20-th experiment.

From the time_elapsed data I recorded in my local database, there are noticeable gaps within one simulation run. For example, the time_elapsed could jump by several seconds instead of constantly increasing as expected. Does this has something to do with the conflic between multiprocessing modules from different packages as mentioned by @TimotheeMathieu.

To help illustrate this, I have uploaded a snapshot of the data I recorded. Thanks in advance.
snapshot.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion This issue needs further discussion enhancement New feature or request Marathon To do during Marathon
Projects
Marathon rlberry
In progress
Development

No branches or pull requests

6 participants