RFE: Expose llama_cpp.server --n_ctx option #1074

jmontleon · 2024-05-09T21:03:09Z

Is your enhancement related to a problem? Please describe

Trying to run against TheBloke/Mistral-7B-Instruct-v0.2-GGUF I was receiving messages:

Error code: 400 - {'error': {'message': "This model's maximum context length is 2048 tokens. However, you requested 2981 tokens (2981 in the messages, None in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

Describe the solution you'd like

llama_cpp.server has an option to --n_ctx to adjust the context size:
https://llama-cpp-python.readthedocs.io/en/latest/server/#server-options

By running a custom image with this added I was able to run my queries without receiving this message.

It would probably be pretty easy to pass it as an env var like as is done for HOST, PORT, etc.

Describe alternatives you've considered

No response

Additional context

I was trying to run the https://github.com/konveyor-ecosystem/kai/ demo against Podman AI Lab when I encountered these errors.

The text was updated successfully, but these errors were encountered:

jmontleon mentioned this issue May 9, 2024

RFE: Allow overriding the default image #1075

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: Expose llama_cpp.server --n_ctx option #1074

RFE: Expose llama_cpp.server --n_ctx option #1074

jmontleon commented May 9, 2024

RFE: Expose llama_cpp.server --n_ctx option #1074

RFE: Expose llama_cpp.server --n_ctx option #1074

Comments

jmontleon commented May 9, 2024

Is your enhancement related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context