Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing logits in Executor API when using return_generation_logits #1569

Open
2 of 4 tasks
AlessioNetti opened this issue May 10, 2024 · 2 comments
Open
2 of 4 tasks
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@AlessioNetti
Copy link

AlessioNetti commented May 10, 2024

System Info

  • Nvidia A40
  • CUDA 12.2
  • TensorRT 10.0.1.6
  • TensorRT-LLM 0.10.0.dev2024050700

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When requesting generation logits via the Executor API's Python bindings, in certain cases one entire generation step is missing in the logits tensor. As a result, all of the subsequent generation steps in the logits tensor are shifted by 1 and do not match anymore with the generated tokens.

This appears to happen specifically when the generation loop terminates early due to reaching end_id: if a custom stop sequence is encountered, or if the maximum number of new tokens is reached, the returned logits are correct.

The problem can be reproduced with the steps below:

python convert_checkpoint.py --model_dir ./falcon_7b_tp1_instruct/ --dtype bfloat16 --output_dir ./falcon_7b_tp1_instruct_trt_chkpt

trtllm-build --checkpoint_dir ./falcon_7b_tp1_instruct_trt_chkpt/ --gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon_7b_tp1_instruct_p200_g200 --gather_all_token_logits --max_input_len 200 --max_output_len 200 --max_batch_size 64

python example_basic.py --model_path ./falcon_7b_tp1_instruct_p200_g200

The examples/bindings/executor/example_basic.py script was modified to issue a request that exhibits the issue, and to print the arg-max of the logits at each generation step. Below is the modified script:

diff --git a/examples/bindings/executor/example_basic.py b/examples/bindings/executor/example_basic.py
index 2c7a3fc..3f1991f 100644
--- a/examples/bindings/executor/example_basic.py
+++ b/examples/bindings/executor/example_basic.py
@@ -1,4 +1,5 @@
 import argparse
+import torch
 
 import tensorrt_llm.bindings.executor as trtllm
 
@@ -21,8 +22,10 @@ if __name__ == "__main__":
 
     if executor.can_enqueue_requests():
         # Create the request.
-        request = trtllm.Request(input_token_ids=[1, 2, 3, 4],
-                                 max_new_tokens=10)
+        request = trtllm.Request(input_token_ids=[100, 20, 3, 18],
+                                 max_new_tokens=20,
+                                 end_id=25,
+                                 output_config=trtllm.OutputConfig(return_generation_logits=True))
 
         # Enqueue the request.
         request_id = executor.enqueue_request(request)
@@ -30,6 +33,9 @@ if __name__ == "__main__":
         # Wait for the new tokens.
         responses = executor.await_responses(request_id)
         output_tokens = responses[0].result.output_token_ids
+        output_top_tokens = torch.argmax(responses[0].result.generation_logits[0], dim=1).tolist()
+
 
         # Print tokens.
-        print(output_tokens)
+        print(f"Output tokens:   {output_tokens[0][4:]}")
+        print(f"Logits arg-max:  {output_top_tokens}")

Expected behavior

Since we are using top_k=1 and are not sampling tokens, we expect the argmax of the logits at each generation step to match exactly the tokens returned for the request.

actual behavior

The generated tokens and the argmax of the logits do not match, and the latter is missing one entire generation step:

Output tokens:   [94, 241, 914, 818, 271, 577, 402, 2862, 271, 1730, 544, 248, 1079, 1111, 612]
Logits arg-max:  [94, 241, 914, 818, 271, 577, 402, 2862, 1730, 544, 248, 1079, 1111, 612, 25, 0, 0, 0, 0, 0]

Notice how token 271 is missing toward the end of the logits argmax sequence, and how all subsequent tokens are shifted by 1.

additional notes

The issue was observed on all TensorRT-LLM 0.10 dev versions, up to 0.10.0.dev2024050700.

@AlessioNetti AlessioNetti added the bug Something isn't working label May 10, 2024
@byshiue byshiue added the triaged Issue has been triaged by maintainers label May 14, 2024
@trevor-m
Copy link
Collaborator

Thanks for filing this issue @AlessioNetti, I was able to reproduce the bug. Taking a look now.

@trevor-m
Copy link
Collaborator

trevor-m commented May 17, 2024

I think I found the issue. We should be able to get the fix in soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants