New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction of timeouts, ensemble scheduler and oldest sequence scheduler causes state leakage #7117
Comments
Okay have managed to create a reproduction independent of our setup. Firstly you need to define a torch model like so
Then create two models with identical model configs
And join them together with this ensemble model
Launch a triton server and target it with this script: https://gist.github.com/jamied157/a047d6314d83499c9811a642a895ad2a This launches 3 clients against the server, two of them send 0's and one sends Then
In the original terminal you should eventually get a message like
In the logs you should also be able to see a line like
This shows the state for one of the zero sequences has been contaminated with nans. |
I also have logs from |
Quick update on this, we think we've found the issue. We're seeing multiple calls to This results in duplicate sequence slots being added to |
@Tabrizian I have found the root cause and it isn't just affecting ensemble models. triton-inference-server/core#341. but potentially all sequence models that use implicit state and are accessed with the gRPC client |
Thanks for proposing a fix @HennerM and filing a detailed a GitHub issue @jamied157. We'll take a look at this and get back to you. |
Just to add a bit more detail:
|
Hi @jamied157, Thanks for such detailed repro steps and investigation! I had a quick follow-up question from your description so far. You mentioned that you can reproduce this without request cancellation:
But it looks your repro steps involve triggering a request cancellation by closing the client connection from the original grpc client script:
Just to clarify - is the reproducible behavior the same prior to 23.10? Or are there different symptoms and triggers? |
Another question @jamied157 @HennerM - given Henner's proposed fix for the sequence batcher issue for a standalone sequence model (no ensemble), and:
Do you have any kind of ensemble repro for an ensemble of just 1 model, rather than a pipeline of 2 for the sake of something more minimal? |
I have repro Python script which should work if you have pytorch and tritonserver installed: https://gist.github.com/HennerM/6ae28bc0360aae96bc5feb422aa1375c or if you already have the model repository setup that Jamie mentioned above then you should be able to run a variation of this: import asyncio
import functools
import gc
import os
from typing import Optional
import numpy as np
from tritonclient.grpc import InferenceServerClient as SyncInferenceServerClient
from tritonclient.grpc.aio import InferenceServerClient, InferInput, InferResult
def create_input(seq_id: int):
input = [InferInput("INPUT__0", [1, FIXED_LAST_DIM], "INT64")]
input[0].set_data_from_numpy( np.full([1, FIXED_LAST_DIM], fill_value=seq_id, dtype=np.int64))
return input
def check_result(seq_id: int, res: asyncio.Future[Optional[InferResult]]):
try:
r = res.result()
assert r is not None
np_res = r.as_numpy("OUTPUT__0")
assert np_res is not None
if not np.all(np_res % seq_id == 0):
print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True)
sys.exit(1)
except asyncio.CancelledError:
pass
async def continuous_inference(
client: InferenceServerClient,
model,
offset: int,
candidates=1,
max_requests=10,
time_between_requests=MAX_IDLE_SECONDS * 0.5,
end_on_cancel: bool = True,
):
for seq_id in range(offset, offset + candidates):
print(f"Starting stream {seq_id}")
t = asyncio.create_task(
client.infer(
model,
create_input(seq_id),
sequence_start=True,
sequence_id=seq_id,
request_id=f"{seq_id}_init",
)
)
t.add_done_callback(functools.partial(check_result, seq_id))
await asyncio.sleep(time_between_requests)
for req_id in range(max_requests):
for seq_id in range(offset, offset + candidates):
print(f"Sending request {req_id} for {seq_id}")
t = asyncio.create_task(
client.infer(
model,
create_input(seq_id),
sequence_id=seq_id,
request_id=f"{seq_id}_{req_id}",
sequence_end=req_id==max_requests-1,
)
)
t.add_done_callback(functools.partial(check_result, seq_id))
await asyncio.sleep(time_between_requests)
TARGET_MODEL = "simple_model"
async def simple_cancellation():
print("Wait for Triton to be ready")
await asyncio.sleep(2)
client = InferenceServerClient(TRITON_URL)
streams_1 = asyncio.create_task(
continuous_inference(client, TARGET_MODEL, 100, candidates=1, max_requests=5, time_between_requests=0.1)
)
await asyncio.sleep(0.3)
print("Cancelling stream")
streams_1.cancel()
await client.close()
del client
gc.collect()
print("Wait for timeout")
await asyncio.sleep(delay=MAX_IDLE_SECONDS + 2)
bob = InferenceServerClient(TRITON_URL)
print("Starting new streams")
await asyncio.create_task(
continuous_inference(bob, TARGET_MODEL, offset=200, candidates=4, max_requests=5, time_between_requests=1)
) |
We only have reproduced similar behaviour, that is multiple sequences getting assigned the same slot and state, in our internal production Triton setup, and didn't manage yet to get this into a min-repro. |
Thanks @HennerM, appreciate the diligence here! So the current status is:
Is that right? In the meantime, I can help take a look into (2) to see if I can find anything that looks logically related, but a reproducer would greatly help expedite that. |
Thanks a lot. Definitely appreciate the help here. on (2) I need to clarify that we actually haven’t noticed the wrong behaviour with the patch above yet, it’s just that the patch deals with a problem with cancellation so we have a belief that this isn’t quite fixing the problem. for us the sequence batcher code, potentially couple with ensembles is quite complex and there are a lot of edge cases that can happen in that combination. We just want to be double sure that no edge cases are missed. |
Hi @HennerM, we're looking into this single-model reproducer you shared above, but have a few questions.
< import sys
12,15d10
< MAX_IDLE_SECONDS=10
< TRITON_URL="localhost:8001"
< FIXED_LAST_DIM=5
<
95,97d89
<
< if __name__ == "__main__":
< asyncio.run(simple_cancellation())
print(f"====CROSSOVER==== {np_res=} for {seq_id=}, triton_result", flush=True) Having this simpler single-model reproducer that does cancellation in-line would be really helpful for further debugging, so any extra details you can share or step-by-step instructions with complete scripts to get it working would be very helpful! |
Sorry I am on UK hours, here is an updated script: https://gist.github.com/HennerM/cde5037442a2efa4238246cc5ccb7b51 I think I didn't leave enough time for the triton server to start, let me know if you have more questions |
Thanks @HennerM! I was able to reproduce the crossover with your updated gist. Will reach out if we have more questions. |
Description
We're still in progress in diagnosing this issue but we're seeing some strange behaviour and would like to get some feedback/ideas from others.
We're using a setup that relies on two separate models using the oldest sequence scheduler combined with the ensemble model. Essentially
Input -> Model_1 -> Model_2 -> Output
Where model 1 and model 2 use implicit state management with the PyTorch backend, they have an initial state but then use the output from the model to update the state.
Our issue tends to arise when the triton is under load, if we use up all the candidate sequence slots - then we don't send END flags on any of them and instead let triton time them out. We then see that Model_2 is in a bad state, in this state, if we run requests against model 2, it's possible to leak information across requests and corrupt the outputs of the model. It tends to look something like:
Here we're displaying the outputs of the model, we've corrupted one batch with
NaN
values and used random inputs for others - however the NaN values leak into the next request sent to the server (here the batches are run sequentially so leaking happens across executions rather than within one)Given all this, it feels like something isn't quite working correctly with the interaction between the ensemble, the sequence timeout and implicit state management.
Triton Information
What version of Triton are you using?
We've reproed on NGC version 22.12, 23.10, 24.02
Are you using the Triton container or did you build it yourself?
We've reproed on containers based on the triton container and on versions built by us.
To Reproduce
We're working on this, currently we only can reproduce with our own models and configuration which we can't share. Hopefully we can find a repro with some simpler models. The basic setup is two sequential models using implicit state management with the PyTorch backend - connected together with the ensemble scheduler. The you need to timeout some/all of the candidate sequences to get
Model_2
into a bad state. OnceModel_2
is in this state you can fairly reliably see the state corruption reported above.We still can't reliably reproduce this state however so we can't provide much more info.
Expected behavior
We shouldn't be able to affect the outputs of the model with the inputs from a different client.
The text was updated successfully, but these errors were encountered: