Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-14B-Chat inference repeat #1144

Open
Storm0921 opened this issue Jan 16, 2024 · 5 comments
Open

Qwen-14B-Chat inference repeat #1144

Storm0921 opened this issue Jan 16, 2024 · 5 comments
Assignees

Comments

@Storm0921
Copy link

When i use python_api_example or streaming_llm python scripts to inference Qwen-14B-Chat,the first two questions were outputted normally, but the third question has been repeating itself since then. I find it strange and can stably reproduce this error. And it seems like something has been repeating the prompts all along.

my RAG prompt length=654
image

image

@a32543254
Copy link
Contributor

Repeat is kind of normal in LLM models.
Here is some possible solution:

  1. Try to use do_sample=True in generate api ?
  2. change woq_config args : compute dtype from int8 to bf16
  3. Increase reptition_penalty value
  4. Increase top_k value

@Storm0921
Copy link
Author

Repeat is kind of normal in LLM models. Here is some possible solution:

  1. Try to use do_sample=True in generate api ?
  2. change woq_config args : compute dtype from int8 to bf16
  3. Increase reptition_penalty value
  4. Increase top_k value

I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} ",but failed

@Storm0921
Copy link
Author

Repeat is kind of normal in LLM models. Here is some possible solution:

  1. Try to use do_sample=True in generate api ?
  2. change woq_config args : compute dtype from int8 to bf16
  3. Increase reptition_penalty value
  4. Increase top_k value

Can you help me to sovle this problem?

@Zhenzhong1
Copy link
Collaborator

Zhenzhong1 commented Jan 22, 2024

  1. I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} ",but failed

@fengenbao Hi, Baichuan does not need to add extra prompt templates.


Try to use do_sample=True in generate api ?
change woq_config args : compute dtype from int8 to bf16
Increase reptition_penalty value
Increase top_k valueCan you help me to sovle this problem?

These are all input args that you can modify.

do_sample = True is a args of the API. For example:
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)

Please check this README.md. https://github.com/intel/neural-speed/tree/main

woq_config please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.

reptition_penalty & top_k value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

@Storm8878
Copy link

Storm8878 commented Jan 22, 2024

  1. I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} ",but failed

@fengenbao Hi, Baichuan does not need to add extra prompt templates.

Try to use do_sample=True in generate api ?
change woq_config args : compute dtype from int8 to bf16
Increase reptition_penalty value
Increase top_k valueCan you help me to sovle this problem?

These are all input args that you can modify.

do_sample = True is a args of the API. For example: outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)

Please check this README.md. https://github.com/intel/neural-speed/tree/main

woq_config please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.

reptition_penalty & top_k value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

Thanks for your attention! I classify this question in detail in another issue #1148, please help check if the parameters are set correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants