Multiprocessing is slow #329

JulienT01 · 2023-07-07T08:43:17Z

running "ltest_dqn_vs_mdqn_acrobot.py" with 10000 budget.

doing n_fit=4 is longer than 2* n_fit=2 when using parallelization="process"

TODO : add regression test 2fit faster than 2*1fit (with multiprocessing)

TimotheeMathieu · 2023-07-07T14:29:05Z

From the tests I did, it seems to be a conflict between python multiprocessing and pytorch multiprocessing.

I just tried by replacing everything multiprocessing in AgentManager by joblib and there is no problem anymore, n_fit=4 is faster than 2 times n_fit=2.

@omardrwch : why did you choose not to use joblib before ? It is a lot simpler to code, and I don't see why you would need multiprocessing instead.

omardrwch · 2023-07-10T07:11:09Z

Hello! Actually, in the very first implementation of AgentManager, I was using joblib. But - at least at that time - there was a problem with jobs that were creating subprocesses themselves (i.e., if an Agent created by an AgentManager creates new processes). If I remember correctly, I got the error daemonic processes are not allowed to have children.

Another advantage of multiprocessing is that possibility of using spawn, which is more robust (each agent basically having its own interpreter), e.g. https://stackoverflow.com/a/66113051.

We could maybe add a parallelization = "joblib" option in AgentManager, but I think it's important to keep Python's multiprocessing as an option for those reasons.

riiswa · 2023-07-10T13:22:53Z

Another suggestion can be to use the multiprocessing subpackage of PyTorch (https://pytorch.org/docs/stable/multiprocessing.html#module-torch.multiprocessing) instead of the std one.

Little document about multiprocessing best practices in Pytorch : https://pytorch.org/docs/stable/notes/multiprocessing.html

RockmanZheng · 2023-10-24T14:18:03Z

Hi @omardrwch. I have been using rlberry for some time. I have encountered an issue that could be related to the multiprocessing module mentioned in this issue. I was running 20 simple bandit experiments with 250 horizon with 1000 workers (simulations). The simulation times took longer and longer, where in the first experiment it took 7s to complete one simulation while it took 137s for one run in the 20-th experiment.

From the time_elapsed data I recorded in my local database, there are noticeable gaps within one simulation run. For example, the time_elapsed could jump by several seconds instead of constantly increasing as expected. Does this has something to do with the conflic between multiprocessing modules from different packages as mentioned by @TimotheeMathieu.

To help illustrate this, I have uploaded a snapshot of the data I recorded. Thanks in advance.
snapshot.csv

KohlerHECTOR added enhancement New feature or request discussion This issue needs further discussion Marathon To do during Marathon labels Jul 13, 2023

KohlerHECTOR added this to To do in Marathon rlberry Jul 13, 2023

riiswa moved this from To do to In progress in Marathon rlberry Jul 24, 2023

riiswa self-assigned this Jul 24, 2023

riiswa mentioned this issue Jul 24, 2023

Fix slow multiprocessing #344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing is slow #329

Multiprocessing is slow #329

JulienT01 commented Jul 7, 2023 •

edited

TimotheeMathieu commented Jul 7, 2023

omardrwch commented Jul 10, 2023

riiswa commented Jul 10, 2023

RockmanZheng commented Oct 24, 2023 •

edited

Multiprocessing is slow #329

Multiprocessing is slow #329

Comments

JulienT01 commented Jul 7, 2023 • edited

TimotheeMathieu commented Jul 7, 2023

omardrwch commented Jul 10, 2023

riiswa commented Jul 10, 2023

RockmanZheng commented Oct 24, 2023 • edited

JulienT01 commented Jul 7, 2023 •

edited

RockmanZheng commented Oct 24, 2023 •

edited