Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker container hello world error socket name resolution [BUG] #2377

Open
rachelglenn opened this issue Feb 23, 2024 · 3 comments
Open

Docker container hello world error socket name resolution [BUG] #2377

rachelglenn opened this issue Feb 23, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@rachelglenn
Copy link

I am trying to use a docker container to run examples in NVFLARE. I build the docker container by editing the one provided in the master branch of NFLARE.

https://github.com/NVIDIA/NVFlare/blob/main/docker/Dockerfile

I built the docker container and am running the container. I am trying to get the example hello-pt to run inside the docker container.
podman run --rm -it --security-opt label=disable --gpus all -p 8888:8888 --ulimit stack=67108864 --device nvidia.com/gpu=all -v /workspace/:/workspace localhost/nvflare/nvflare /bin/bash

nvflare simulator -w /tmp/nvflare/test -n 2 -t 2 /workspace/NVFlare_example/jobs/hello-pt

024-02-23 12:22:37,484 - SimulatorRunner - INFO - Create the Simulator Server.
2024-02-23 12:22:37,486 - CoreCell - INFO - server: creating listener on tcp://0:38967
2024-02-23 12:22:37,510 - CoreCell - INFO - server: created backbone external listener for tcp://0:38967
2024-02-23 12:22:37,510 - ConnectorManager - INFO - 66: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-02-23 12:22:37,511 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:51937] is starting
2024-02-23 12:22:38,012 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:51937
2024-02-23 12:22:38,013 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:38967] is starting
2024-02-23 12:22:38,092 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 53769
2024-02-23 12:22:38,092 - SimulatorRunner - INFO - Deploy the Apps.
2024-02-23 12:22:38,101 - SimulatorRunner - INFO - Create the simulate clients.
2024-02-23 12:22:38,105 - ClientManager - INFO - Client: New client site-1@10.0.2.100 joined. Sent token: 30d9d61e-a6d2-4892-8706-11ed31417cb7.  Total clients: 1
2024-02-23 12:22:38,105 - FederatedClient - INFO - Successfully registered client:site-1 for project simulator_server. Token:30d9d61e-a6d2-4892-8706-11ed31417cb7 SSID:
2024-02-23 12:22:38,106 - ClientManager - INFO - Client: New client site-2@10.0.2.100 joined. Sent token: 826ff296-60ba-4b85-91e3-8fec007dcf20.  Total clients: 2
2024-02-23 12:22:38,106 - FederatedClient - INFO - Successfully registered client:site-2 for project simulator_server. Token:826ff296-60ba-4b85-91e3-8fec007dcf20 SSID:
2024-02-23 12:22:38,106 - SimulatorRunner - INFO - Set the client status ready.
2024-02-23 12:22:38,106 - SimulatorRunner - INFO - Deploy and start the Server App.
2024-02-23 12:22:38,107 - Cell - INFO - Register blob CB for channel='server_command', topic='*'
2024-02-23 12:22:38,108 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'
2024-02-23 12:22:38,108 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.simulate_job
2024-02-23 12:22:40,378 - matplotlib.font_manager - INFO - generated new fontManager
2024-02-23 12:22:41,672 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: Server runner starting ...
2024-02-23 12:22:41,673 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job]: starting workflow pre_train (<class 'nvflare.app_common.workflows.initialize_global_weights.InitializeGlobalWeights'>) ...
2024-02-23 12:22:41,673 - InitializeGlobalWeights - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: Initializing BroadcastAndProcess.
2024-02-23 12:22:41,673 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: Workflow pre_train (<class 'nvflare.app_common.workflows.initialize_global_weights.InitializeGlobalWeights'>) started
2024-02-23 12:22:41,674 - InitializeGlobalWeights - INFO - [identity=simulator_server, run=simulate_job, wf=pre_train]: scheduled task get_weights
2024-02-23 12:22:42,112 - SimulatorClientRunner - INFO - Start the clients run simulation.
2024-02-23 12:22:43,114 - SimulatorClientRunner - INFO - Simulate Run client: site-1 on GPU group: None
2024-02-23 12:22:43,114 - SimulatorClientRunner - INFO - Simulate Run client: site-2 on GPU group: None
2024-02-23 12:22:44,138 - ClientTaskWorker - INFO - ClientTaskWorker started to run
2024-02-23 12:22:44,145 - ClientTaskWorker - INFO - ClientTaskWorker started to run
2024-02-23 12:22:44,193 - CoreCell - INFO - site-1.simulate_job: created backbone external connector to tcp://localhost:38967
2024-02-23 12:22:44,194 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:38967] is starting
2024-02-23 12:22:44,194 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:46358 => 127.0.0.1:38967] is created: PID: 89
2024-02-23 12:22:44,195 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 127.0.0.1:38967 <= 127.0.0.1:46358] is created: PID: 66
2024-02-23 12:22:44,200 - CoreCell - INFO - site-2.simulate_job: created backbone external connector to tcp://localhost:38967
2024-02-23 12:22:44,200 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:38967] is starting
2024-02-23 12:22:44,201 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:46370 => 127.0.0.1:38967] is created: PID: 90
2024-02-23 12:22:44,201 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00006 127.0.0.1:38967 <= 127.0.0.1:46370] is created: PID: 66
2024-02-23 12:22:47,375 - JsonScanner - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/local/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/usr/local/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/usr/local/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/usr/local/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

``

@rachelglenn rachelglenn added the bug Something isn't working label Feb 23, 2024
@YuanTingHsieh
Copy link
Collaborator

Hi @rachelglenn thanks for your interest!

Did you run prepare_data.sh first? (bash ./prepare_data.sh)

If your docker container can't connect to outside network, you can download the data before you start your container.
And then mount the data directory.

Be sure to modify the data root in https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-pt/jobs/hello-pt/app/custom/cifar10trainer.py#L40
and https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-pt/jobs/hello-pt/app/custom/cifar10validator.py#L31

I would actually suggest you go through these examples first: https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/ml-to-fl/pt

@chesterxgchen
Copy link
Collaborator

@IsaacYangSLA can you help with some insight ?

@YuanTingHsieh
Copy link
Collaborator

@rachelglenn Can you try to run the hello-numpy-sag example inside and see if it works as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants