Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistant final models trained by Swarm Learning #240

Open
Ultimate-Storm opened this issue Feb 29, 2024 · 1 comment
Open

Inconsistant final models trained by Swarm Learning #240

Ultimate-Storm opened this issue Feb 29, 2024 · 1 comment

Comments

@Ultimate-Storm
Copy link

Issue description

  • issue description: We observed inconsistency in the final models trained by Swarm Learning. We have two nodes involved in swarm learning. However whether we are picking the last or best checkpoint for prediction the results are significantly different from each other.
  • occurrence - consistent or rare: Consistent
  • error messages: None
  • commands used for starting containers:
  • docker logs [APLS, SPIRE, SN, SL, SWCI]:
    swop_u.log
    swci_u.log
    sn_u.log
    sl_u.log
    ml_u.log

Python scripts used to reproduce this problem:
base_model.txt
main.txt

Swarm Learning Version:

2.2.0

  • Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
    2.2.0

OS and ML Platform

  • details of host OS:
  • details of ML platform used: pytorch-lightening
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes):
    2 machines, 2 sl-ml node pairs

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses?
  • If Multiple systems are used, can each system access every other system?
  • Is Password-less SSH configuration setup for all the systems?
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
  • Is the user id a member of the docker group?

Additional notes

  • Are you running documented example without any modification?
  • Add any additional information about use case or any notes which supports for issue investigation:

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

@Ultimate-Storm Ultimate-Storm changed the title Unconsistant final models trained by Swarm Learning Inconsistant final models trained by Swarm Learning Mar 5, 2024
@htjain
Copy link
Collaborator

htjain commented Mar 13, 2024

@Ultimate-Storm Can you please re-upload logs? I am getting 404 while downloading attached logs.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants