You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
srcansiz opened this issue
Apr 9, 2024
· 0 comments
Labels
bugthis issue is about reporting and resolving a suspected bugcandidatean individual developer submits a work request to the team (extension proposal, bug, other request)
The default behavior of the default strategy is that if any nodes returns an unsuccessful reply the training round should fail. However, DefaultStrategy.refine function in Experiment.run_once only gets training replies that are success. This is due to Job._get_training_result that does not add success = False replies into training_replies dictionary. This situation results as successful training round even when one or more than one node returns success=False if at least a node return successful training reply.
Simple solution can be to add unsuccessful reply to the training replies and let strategy class do the rest. However, this solution also requires to modify function _extract_received_optimizer_aux_var_from_round to only allow extraction for successful replies.
The text was updated successfully, but these errors were encountered:
srcansiz
added
bug
this issue is about reporting and resolving a suspected bug
candidate
an individual developer submits a work request to the team (extension proposal, bug, other request)
labels
Apr 9, 2024
srcansiz
changed the title
DefaultStrategy only receives succesfull training replies
DefaultStrategy only receives successful training replies
Apr 9, 2024
bugthis issue is about reporting and resolving a suspected bugcandidatean individual developer submits a work request to the team (extension proposal, bug, other request)
The default behavior of the default strategy is that if any nodes returns an unsuccessful reply the training round should fail. However,
DefaultStrategy.refine
function inExperiment.run_once
only gets training replies that are success. This is due toJob._get_training_result
that does not addsuccess = False
replies intotraining_replies
dictionary. This situation results as successful training round even when one or more than one node returnssuccess=False
if at least a node return successful training reply.Simple solution can be to add unsuccessful reply to the training replies and let strategy class do the rest. However, this solution also requires to modify function
_extract_received_optimizer_aux_var_from_round
to only allow extraction for successful replies.The text was updated successfully, but these errors were encountered: