Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Expose llama_cpp.server --n_ctx option #1074

Open
jmontleon opened this issue May 9, 2024 · 0 comments
Open

RFE: Expose llama_cpp.server --n_ctx option #1074

jmontleon opened this issue May 9, 2024 · 0 comments

Comments

@jmontleon
Copy link

Is your enhancement related to a problem? Please describe

Trying to run against TheBloke/Mistral-7B-Instruct-v0.2-GGUF I was receiving messages:

Error code: 400 - {'error': {'message': "This model's maximum context length is 2048 tokens. However, you requested 2981 tokens (2981 in the messages, None in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Describe the solution you'd like

llama_cpp.server has an option to --n_ctx to adjust the context size:
https://llama-cpp-python.readthedocs.io/en/latest/server/#server-options

By running a custom image with this added I was able to run my queries without receiving this message.

It would probably be pretty easy to pass it as an env var like as is done for HOST, PORT, etc.

Describe alternatives you've considered

No response

Additional context

I was trying to run the https://github.com/konveyor-ecosystem/kai/ demo against Podman AI Lab when I encountered these errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant