Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Researcher component needs better memory management #1101

Closed
srcansiz opened this issue May 2, 2024 · 2 comments
Closed

Researcher component needs better memory management #1101

srcansiz opened this issue May 2, 2024 · 2 comments
Assignees
Labels
attic the entry is not completed, but is now considered obsolete and closed

Comments

@srcansiz
Copy link
Member

srcansiz commented May 2, 2024

While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.

Model size: 600MB
Number of rounds of training: 100
Number of nodes: 3
Dry run: True
Operating System: Mac M3
Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.

After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:

envs/fedbiomed-researcher/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

on jupyter notebook after reaching round 22 the kernel dies.

This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to training_replies that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.

Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.

@srcansiz srcansiz added candidate an individual developer submits a work request to the team (extension proposal, bug, other request) needs-triage a user opens an issue, that needs to be tagged appropriatly by the development team and removed needs-triage a user opens an issue, that needs to be tagged appropriatly by the development team labels May 2, 2024
@srcansiz srcansiz changed the title Researcher component need better memory management Researcher component needs better memory management May 2, 2024
@mvesin
Copy link
Member

mvesin commented May 2, 2024

Hi @srcansiz

This is a know behaviour/limitation :-)

To avoid such issue you can use (currently in develop not in master) you can use the following to keep only training_replies for last round:

exp.set_retain_full_history(False)

We also noted in #207 a point to (possibly) re-implement training replies so as to keep only last round in memory (and other rounds on disk for minimal memory impact).

@srcansiz
Copy link
Member Author

srcansiz commented May 2, 2024

Hi @mvesin,

Thank you very much. I totally forgot that there is this method to avoid this issue. I am going to try the run tests by disabling retain_full_history.

@mvesin mvesin self-assigned this May 27, 2024
@mvesin mvesin added attic the entry is not completed, but is now considered obsolete and closed and removed candidate an individual developer submits a work request to the team (extension proposal, bug, other request) labels May 27, 2024
@mvesin mvesin closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
attic the entry is not completed, but is now considered obsolete and closed
Projects
None yet
Development

No branches or pull requests

2 participants