Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZMQ port hardcoded and not editable via the GUI #1774

Closed
2 tasks
shrivaths16 opened this issue May 15, 2024 · 2 comments · Fixed by #1780
Closed
2 tasks

ZMQ port hardcoded and not editable via the GUI #1774

shrivaths16 opened this issue May 15, 2024 · 2 comments · Fixed by #1780

Comments

@shrivaths16
Copy link
Contributor

[WIP]

As of now we do not have an option to choose the ZMQ ports via the GUI and it has been hardcoded to tcp://127.0.0.1:9000 for the controller address and tcp://127.0.0.1:9001 for the publish address. Sometimes there is an issue when there are multiple SLEAP applications that are open and trained with, leading to "ZMQError: Address already in use" as mentioned in discussion #1751.

In order to solve this, we need to make some changes as listed below:

  • Update frontend loss viewer GUI which has the ports hardcoded here
  • Update controller_address and publish_address in the ZMQ section of the training job config via the training editor GUI (just ask users to specify ports and assume that the base of the address is always tcp://127.0.0.1)
@talmo
Copy link
Collaborator

talmo commented May 17, 2024

The ZMQConfig is used here to setup the TrainingControllerZMQ and ProgressReporterZMQ:

sleap/sleap/nn/training.py

Lines 396 to 412 in 18aad91

def setup_zmq_callbacks(zmq_config: ZMQConfig) -> List[tf.keras.callbacks.Callback]:
"""Set up ZeroMQ callbacks from config."""
callbacks = []
if zmq_config.subscribe_to_controller:
callbacks.append(
TrainingControllerZMQ(
address=zmq_config.controller_address,
poll_timeout=zmq_config.controller_polling_timeout,
)
)
logger.info(f" ZMQ controller subcribed to: {zmq_config.controller_address}")
if zmq_config.publish_updates:
callbacks.append(ProgressReporterZMQ(address=zmq_config.publish_address))
logger.info(f" ZMQ progress reporter publish on: {zmq_config.publish_address}")
return callbacks

Which is called from setup_output_callbacks here:

callbacks.extend(setup_zmq_callbacks(config.zmq))

Which is called from Trainer._setup_outputs here:

sleap/sleap/nn/training.py

Lines 851 to 853 in 18aad91

self.output_callbacks = setup_output_callbacks(
self.config.outputs, run_path=self.run_path
)

It uses the TrainingJobConfig to specify the ZMQ address/port, which derives from the loaded config file.

When calling from the CLI, we do already overwrite some parts of the config with the CLI provided options, for example, to enable/disable ZMQ entirely:

sleap/sleap/nn/training.py

Lines 1924 to 1928 in 18aad91

# Override config settings for CLI-based training.
job_config.outputs.save_outputs = True
job_config.outputs.tensorboard.write_logs |= args.tensorboard
job_config.outputs.zmq.publish_updates |= args.zmq
job_config.outputs.zmq.subscribe_to_controller |= args.zmq

Next to this block, we should also support specifying the ZMQ port explicitly and overwriting the appropriate config fields:

job_config.outputs.zmq.controller_address
job_config.outputs.zmq.publish_address

@talmo
Copy link
Collaborator

talmo commented May 17, 2024

Another nice option could be to try to automatically detect a free port using Socket.bind_to_random_port().

We still need to know what the port is in order to pass it to the backend, so just calling this by itself wouldn't work, but we could use it to write a utility function to discover a free port, e.g.:

def find_free_port():
    ctx = zmq.Context.instance()
    socket = ctx.socket()
    port = socket.bind_to_random_port("tcp://127.0.0.1")
    socket.disconnect()
    return port

@shrivaths16 shrivaths16 linked a pull request May 23, 2024 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants