Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use first bad_words as extra parameters, and implement min-p #1536

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pathorn
Copy link

@pathorn pathorn commented May 2, 2024

An approach for implementing #1154

The user-facing classes in BatchManager and tensorrt_llm::executor::SamplingConfig are not open source (the constructor implementation exists in .a files and will segfault if the class is modified), so we hacked it by using the integers in the first bad_words for extra parameters.

In this case, we're using the first integer reinterpreted as a float, to represent min_p (the default value is 0.0 which matches the 0 padding in bad_words).

I implemented the min-p by piggybacking on the existing logprobs calculation in cuda, so it should have no additional performance overhead beyond the logprobs calculation.

@pathorn
Copy link
Author

pathorn commented May 2, 2024

Here is some example BLS code for adding the min_p value into the bad_words list in the way this PR expects:

            numpy_tensor = preproc_output_tensor.as_numpy()
            if trtllm_tensor_name == "bad_words_list":
                bad_words_data, bad_words_offsets = numpy_tensor[0]
                opt = np.get_printoptions()
                np.set_printoptions(threshold=np.inf)
                pprint(numpy_tensor)
                min_p = 0.0
                if "min_p" in bls_input_tensors_map:
                    minptensor = bls_input_tensors_map["min_p"].as_numpy()
                    pprint(minptensor)
                    min_p = minptensor[0,0]
                min_p_int ,= struct.unpack('i', struct.pack('f', min_p))
                extra_data = np.array([min_p_int], dtype=np.int32)
                if bad_words_offsets[0] == -1:
                    # Special case: if no bad_words are passed, numpy_tensor will be [[[0], [-1]]]
                    # In this case, we don't want to prepend [0] because that would add a bad word offset where there otherwise was none.
                    bad_words_data = extra_data
                    bad_words_offsets = np.array([-1], dtype=np.int32)
                else:
                    # Prepend min_p words.
                    bad_words_data = np.concatenate((extra_data, bad_words_data), axis=0)
                    # The offsets array is padded with -1, so we first add one to make the padding all zeros, then trim_zeros and subtract one.
                    bad_words_offsets = np.trim_zeros(bad_words_offsets + 1) - 1
                    # Then, we prepend an extra 0 element to account for an extra bad_word being added.
                    bad_words_offsets = np.concatenate((np.array([0], dtype=np.int32), bad_words_offsets), axis=0)
                    # Then, we offset the indices by the length of the newly added data.
                    bad_words_offsets = bad_words_offsets + len(extra_data)
                    # Finally, we pad this array to make it the same length as bad_words_data:
                    bad_words_offsets = np.concatenate((bad_words_offsets, np.array([-1] * (len(bad_words_data) - len(bad_words_offsets)), dtype=np.int32)), axis=0)
                numpy_tensor = np.array([[bad_words_data, bad_words_offsets]], dtype=np.int32)
                print("Final:")
                pprint(numpy_tensor)
                np.set_printoptions(**opt)

            trtllm_input_tensors.append(
                pb_utils.Tensor(trtllm_tensor_name,
                                numpy_tensor))

@juney-nvidia
Copy link
Collaborator

@pathorn

Hi Pathorn

Thanks for your interest to submit the MR into TRT-LLM.

The current process of merging community MR into TRT-LLM is:

  • After the contributor finishing the implementation with passing the local test. TRT-LLM engineers can help review the MR with providing the feedbacks and then several iterations of code refinements/discussions are necessary :)
  • After the MR is ready to get landed, one TRT-LLM engineer will cherry-pick the MR into our internal git repo.
  • Then later when the new TRT-LLM version gets pushed to the github, we will acknowledge the contributor's name in the announcement notes.

Pls let me know whether the above process makes sense to you.
Thanks

June

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants