Researcher component needs better memory management #1101

srcansiz · 2024-05-02T11:35:45Z

While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.

Model size: 600MB
Number of rounds of training: 100
Number of nodes: 3
Dry run: True
Operating System: Mac M3
Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.

After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:

envs/fedbiomed-researcher/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

on jupyter notebook after reaching round 22 the kernel dies.

This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to training_replies that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.

Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.

The text was updated successfully, but these errors were encountered:

mvesin · 2024-05-02T14:23:32Z

Hi @srcansiz

This is a know behaviour/limitation :-)

To avoid such issue you can use (currently in develop not in master) you can use the following to keep only training_replies for last round:

exp.set_retain_full_history(False)

We also noted in #207 a point to (possibly) re-implement training replies so as to keep only last round in memory (and other rounds on disk for minimal memory impact).

srcansiz · 2024-05-02T14:55:23Z

Hi @mvesin,

Thank you very much. I totally forgot that there is this method to avoid this issue. I am going to try the run tests by disabling retain_full_history.

srcansiz changed the title ~~Researcher component need better memory management~~ Researcher component needs better memory management May 2, 2024

mvesin self-assigned this May 27, 2024

mvesin added attic the entry is not completed, but is now considered obsolete and closed and removed candidate an individual developer submits a work request to the team (extension proposal, bug, other request) labels May 27, 2024

mvesin closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Researcher component needs better memory management #1101

Researcher component needs better memory management #1101

srcansiz commented May 2, 2024 •

edited

mvesin commented May 2, 2024 •

edited

srcansiz commented May 2, 2024

Researcher component needs better memory management #1101

Researcher component needs better memory management #1101

Comments

srcansiz commented May 2, 2024 • edited

mvesin commented May 2, 2024 • edited

srcansiz commented May 2, 2024

srcansiz commented May 2, 2024 •

edited

mvesin commented May 2, 2024 •

edited