Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching for speed. #45

Open
wj210 opened this issue Apr 14, 2024 · 0 comments
Open

Batching for speed. #45

wj210 opened this issue Apr 14, 2024 · 0 comments

Comments

@wj210
Copy link

wj210 commented Apr 14, 2024

Hi, i would like to ask if it has been tried out with llama that batch inference works?

i followed this https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side , where they pass both the input_ids and attention_mask to the model, but i got 'nan' values. If i only passed in the input_ids, it is fine, however I'm not sure if not passing in the attention mask will have any effects on the final output.

Also, there seems to be a bug in

if prompt.endswith(" True or False?\nAnswer:"):

where " True or False?\nAnswer:" is not detected since

prompt = "{}\n\nInput: {} True or False?\nOutput:".format(definition.strip(), atom.strip())
ends with a different output, hence the model generated length is 128 instead of 1. This will waste cost and if in the case of gpt3.5 being used, there may be both 'true' and 'false' in the 128 tokens leading to wrong decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant