Use first bad_words as extra parameters, and implement min-p #1536

pathorn · 2024-05-02T23:26:23Z

An approach for implementing #1154

The user-facing classes in BatchManager and tensorrt_llm::executor::SamplingConfig are not open source (the constructor implementation exists in .a files and will segfault if the class is modified), so we hacked it by using the integers in the first bad_words for extra parameters.

In this case, we're using the first integer reinterpreted as a float, to represent min_p (the default value is 0.0 which matches the 0 padding in bad_words).

I implemented the min-p by piggybacking on the existing logprobs calculation in cuda, so it should have no additional performance overhead beyond the logprobs calculation.

pathorn · 2024-05-02T23:36:49Z

Here is some example BLS code for adding the min_p value into the bad_words list in the way this PR expects:

            numpy_tensor = preproc_output_tensor.as_numpy()
            if trtllm_tensor_name == "bad_words_list":
                bad_words_data, bad_words_offsets = numpy_tensor[0]
                opt = np.get_printoptions()
                np.set_printoptions(threshold=np.inf)
                pprint(numpy_tensor)
                min_p = 0.0
                if "min_p" in bls_input_tensors_map:
                    minptensor = bls_input_tensors_map["min_p"].as_numpy()
                    pprint(minptensor)
                    min_p = minptensor[0,0]
                min_p_int ,= struct.unpack('i', struct.pack('f', min_p))
                extra_data = np.array([min_p_int], dtype=np.int32)
                if bad_words_offsets[0] == -1:
                    # Special case: if no bad_words are passed, numpy_tensor will be [[[0], [-1]]]
                    # In this case, we don't want to prepend [0] because that would add a bad word offset where there otherwise was none.
                    bad_words_data = extra_data
                    bad_words_offsets = np.array([-1], dtype=np.int32)
                else:
                    # Prepend min_p words.
                    bad_words_data = np.concatenate((extra_data, bad_words_data), axis=0)
                    # The offsets array is padded with -1, so we first add one to make the padding all zeros, then trim_zeros and subtract one.
                    bad_words_offsets = np.trim_zeros(bad_words_offsets + 1) - 1
                    # Then, we prepend an extra 0 element to account for an extra bad_word being added.
                    bad_words_offsets = np.concatenate((np.array([0], dtype=np.int32), bad_words_offsets), axis=0)
                    # Then, we offset the indices by the length of the newly added data.
                    bad_words_offsets = bad_words_offsets + len(extra_data)
                    # Finally, we pad this array to make it the same length as bad_words_data:
                    bad_words_offsets = np.concatenate((bad_words_offsets, np.array([-1] * (len(bad_words_data) - len(bad_words_offsets)), dtype=np.int32)), axis=0)
                numpy_tensor = np.array([[bad_words_data, bad_words_offsets]], dtype=np.int32)
                print("Final:")
                pprint(numpy_tensor)
                np.set_printoptions(**opt)

            trtllm_input_tensors.append(
                pb_utils.Tensor(trtllm_tensor_name,
                                numpy_tensor))

juney-nvidia · 2024-05-14T12:48:16Z

@pathorn

Hi Pathorn

Thanks for your interest to submit the MR into TRT-LLM.

The current process of merging community MR into TRT-LLM is:

After the contributor finishing the implementation with passing the local test. TRT-LLM engineers can help review the MR with providing the feedbacks and then several iterations of code refinements/discussions are necessary :)
After the MR is ready to get landed, one TRT-LLM engineer will cherry-pick the MR into our internal git repo.
Then later when the new TRT-LLM version gets pushed to the github, we will acknowledge the contributor's name in the announcement notes.

Pls let me know whether the above process makes sense to you.
Thanks

June

pathorn mentioned this pull request May 2, 2024

Feature Request: Add Min-P sampling layer #1154

Open

pathorn force-pushed the minp_via_badwords_apr30 branch from 7e1acc9 to 3731f5b Compare May 3, 2024 22:40

Use first bad_words as extra parameters, and implement min-p

3d7d658

pathorn force-pushed the minp_via_badwords_apr30 branch from 3731f5b to 3d7d658 Compare May 29, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use first bad_words as extra parameters, and implement min-p #1536

Use first bad_words as extra parameters, and implement min-p #1536

pathorn commented May 2, 2024

pathorn commented May 2, 2024

juney-nvidia commented May 14, 2024

Use first bad_words as extra parameters, and implement min-p #1536

Are you sure you want to change the base?

Use first bad_words as extra parameters, and implement min-p #1536

Conversation

pathorn commented May 2, 2024

pathorn commented May 2, 2024

juney-nvidia commented May 14, 2024