target_prefix latency #1689

SimonBenhamou · 2024-04-30T16:41:14Z

Hello,

I noticed that when supplying a target_prefix to the translate_batch or generate_tokens method, the latencies for generating the supplied tokens is equivalent to the situation where they are not provided, while I would expect negligible latency because those tokens don't require any generation steps. I'm expecting the first step to be the generation of the token after the prefix tokens.

Am I missing something, or is this due to an inefficiency in ctranslate2's generation logic ?

Thanks,
Simon

minhthuc2502 · 2024-05-02T10:28:52Z

If you specified the target_prefix, it would decode once in a step then generate one by one with the next steps. Without target_prefix, it would generate one by one token. In theory, it have to run faster in case of using target_prefix. Could you test with a long prefix ?

SimonBenhamou · 2024-05-02T21:50:31Z

I did, and could reproduce the fact that

no matter how long the prefix, the generation time is the same
when using the generate_token method and measuring the latency, the generation time is the same for prefix tokens than for the subsequent tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

target_prefix latency #1689

target_prefix latency #1689

SimonBenhamou commented Apr 30, 2024

minhthuc2502 commented May 2, 2024

SimonBenhamou commented May 2, 2024

target_prefix latency #1689

target_prefix latency #1689

Comments

SimonBenhamou commented Apr 30, 2024

minhthuc2502 commented May 2, 2024

SimonBenhamou commented May 2, 2024