You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.
Model size: 600MB Number of rounds of training: 100 Number of nodes: 3 Dry run: True Operating System: Mac M3 Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.
After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:
envs/fedbiomed-researcher/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
on jupyter notebook after reaching round 22 the kernel dies.
This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to training_replies that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.
Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.
The text was updated successfully, but these errors were encountered:
srcansiz
added
candidate
an individual developer submits a work request to the team (extension proposal, bug, other request)
needs-triage
a user opens an issue, that needs to be tagged appropriatly by the development team
and removed
needs-triage
a user opens an issue, that needs to be tagged appropriatly by the development team
labels
May 2, 2024
srcansiz
changed the title
Researcher component need better memory management
Researcher component needs better memory management
May 2, 2024
To avoid such issue you can use (currently in develop not in master) you can use the following to keep only training_replies for last round:
exp.set_retain_full_history(False)
We also noted in #207 a point to (possibly) re-implement training replies so as to keep only last round in memory (and other rounds on disk for minimal memory impact).
mvesin
added
attic
the entry is not completed, but is now considered obsolete and closed
and removed
candidate
an individual developer submits a work request to the team (extension proposal, bug, other request)
labels
May 27, 2024
While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.
Model size: 600MB
Number of rounds of training: 100
Number of nodes: 3
Dry run: True
Operating System: Mac M3
Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.
After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:
on jupyter notebook after reaching round 22 the kernel dies.
This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to
training_replies
that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.
The text was updated successfully, but these errors were encountered: