You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
model_repo, log, log_file, tensorrt_llm_model_name):
cmd = ['mpirun', '--allow-run-as-root']
for i in range(world_size):
cmd += ['-n', '1', tritonserver, f'--model-repository={model_repo}']
if log and (i == 0):
cmd += ['--log-verbose=3', f'--log-file={log_file}']
# If rank is not 0, skip loading of models other than `tensorrt_llm_model_name`
if (i != 0):
cmd += ['--model-control-mode=explicit']
model_names = tensorrt_llm_model_name.split(',')
for name in model_names:
cmd += [f'--load-model={name}']
cmd += [
f'--grpc-port={grpc_port}', f'--http-port={http_port}',
f'--metrics-port={metrics_port}', '--disable-auto-complete-config',
f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
]
return cmd
When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001). But how could this be possible?
When I tried to do something similar, I got the following error while launching the second server:
I0513 03:43:28.353306 21205 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0513 03:43:28.353458 21205 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
E0513 03:43:28.353559006 21206 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2024-05-13T03:43:28.353510541+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-05-13T03:43:28.353503146+00:00", children:[UNKNOWN:Address family not supported by protocol {target_address:"[::]:8001", syscall:"socket", os_error:"Address family not supported by protocol", errno:97, created_time:"2024-05-13T03:43:28.353465612+00:00"}, UNKNOWN:Unable to configure socket {fd:6, created_time:"2024-05-13T03:43:28.353493367+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-05-13T03:43:28.353488259+00:00"}]}]}]}
E0513 03:43:28.353650 21206 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use
I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model.
It's ok to run something like mpirun -np 2 python3.8 run.py to load the two engines, run tensor-parallel inference, and get the correct results.
My goal now is to run the same two engines by the triton server.
I have already implemented the run.py logic in the model.py (initialize() and execute() functions) in my python backend.
Question
The codes in launch_triton_server.py:
When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001).
But how could this be possible?
When I tried to do something similar, I got the following error while launching the second server:
Background
I've been developing my triton backend drawing on the experience of https://github.com/triton-inference-server/tensorrtllm_backend.
I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model.
It's ok to run something like
mpirun -np 2 python3.8 run.py
to load the two engines, run tensor-parallel inference, and get the correct results.My goal now is to run the same two engines by the triton server.
I have already implemented the
run.py
logic in the model.py (initialize() and execute() functions) in my python backend.Following launch_triton_server.py, I tried the following command line:
Then I got the error as above.
Could you please tell me what I did wrong and how I can fix the error? Thanks a lot!
The text was updated successfully, but these errors were encountered: