[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

sungkim11 · 2024-04-13T00:28:40Z

Your current environment

Created a docker image
Ran the docker image to inference the model.

🐛 Describe the bug

I have created a docker image of 0.2.0 and ran the same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns a series of blank spaces whereas 0.1.0 works fine.

robertgshaw2-neuralmagic · 2024-04-13T01:26:55Z

Hey @sungkim11 , thanks for reporting the issue.

Can you share:

The dockerfile you used to make the image
The inference example
Any info about the GPU setup you might have (e.g. GPU type, number of GPUs)

I just ran the following to install:

python3 -m venv env
source env/bin/activate
pip install nm-vllm

And then the following for inference, and it seemed to be okay:

from vllm import LLM
model = LLM("neuralmagic/OpenHermes-2.5-Mistral-7B-marlin", max_model_len=4096)
output = model.generate("Hello my name is")
print(output[0].outputs[0].text)
# >> Marissa Cariaga, I am 18 years old.  I

robertgshaw2-neuralmagic · 2024-04-13T01:30:42Z

Note: we also have a pre-made Docker image which should have everything you need.

docker run 
    --gpus all \
    --shm-size \
    ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0 --model neuralmagic/OpenHermes-2.5-Mistral-7B-marlin --max-model-len 4096

sungkim11 · 2024-04-13T01:30:59Z

I did the following today:

git clone repo
docker build -t nm-vllm-openai:0.2.0 .

Inference:

completion = client.chat.completions.create(
    model = llm_model_name,            
    response_format = { 
        "type": "json_object" 
    },
    messages = messages,
    temperature = temperature,
    max_tokens = max_tokens,            
    top_p = top_p,
    frequency_penalty = frequency_penalty,
    presence_penalty = presence_penalty,
    logprobs = logprobs
)

It works fine with 0.1.0 image, but gets bunch of blank line in 0.2.0

sungkim11 · 2024-04-13T01:32:33Z

You can pull the docker image from docker hub -> sungkimmw/nm-vllm-openai:0.2.0 and sungkimmw/nm-vllm:latest (for 0.1.0).

I need to delete the 0.1.0

sungkim11 · 2024-04-13T01:40:10Z

I tried "ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0 " and I am getting bunch of blank lines.

robertgshaw2-neuralmagic · 2024-04-13T01:42:37Z

Just tried each of these:

sungkimmw/nm-vllm-openai:0.2.0
ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0

With this client:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print("Chat completion results:")
print(chat_completion)

And got:

ChatCompletion(id='cmpl-00e280aade9d47448768b2d903b9b04a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played in Arlington, Texas. The games took place at Globe Life Field and Globe Life Park, both of which are located in Arlington. This was due to the COVID-19 pandemic, which led to the games being played at a neutral site in order to minimize travel and potential exposure to the virus.', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1712972481, model='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=75, prompt_tokens=67, total_tokens=142))

robertgshaw2-neuralmagic · 2024-04-13T01:43:23Z

Could you provide the exact client code you are running?

sungkim11 · 2024-04-13T01:47:58Z

def prompt_json_completion(messages):
        
    base_url = os.getenv("BASE_URL", "http://localhost:8000/v1")
    api_key = os.getenv("API_KEY", "EMPTY")
    llm_model_name = os.getenv("LLM_MODEL_NAME", "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin")
    temperature = os.getenv("TEMPERATURE", 0.0)
    max_tokens = os.getenv("MAX_TOKENS", 100)
    top_p = os.getenv("TOP_P", 1)
    frequency_penalty = os.getenv("FREQUENCY_PENALTY", 0)
    presence_penalty = os.getenv("PRESENCE_PENALTY", 0)
    logprobs = os.getenv("LOGPROBS", "true")

    client = OpenAI(api_key = api_key, base_url = base_url)

    completion = client.chat.completions.create(
        model = llm_model_name,            
        response_format = { 
            "type": "json_object" 
        },
        messages = messages,
        temperature = temperature,
        max_tokens = max_tokens,            
        top_p = top_p,
        frequency_penalty = frequency_penalty,
        presence_penalty = presence_penalty,
    )
    #print(completion)
    print(completion.choices[0].message.content)

if __name__ == "__main__":

    user_prompt = "Must I be entitled to claim a child as a dependent to claim the earned income credit based on the child being my qualifying adult?"

    messages = [
                {"role": "system", "content": "You are a helpful assistant designed to output JSON only. Please augment the user question with a context."},
                {"role": "user", "content": "DO NOT answer the query just augment the query and return the augmented user question without any explanations: " + user_prompt}
            ]

  aug_query = prompt_json_completion(messages = messages)

robertgshaw2-neuralmagic · 2024-04-13T02:33:42Z

Thanks. Reproduced.

Everything is working fine with the completions API, but not with the chat completions API. I believe I know what caused this issue, will work on resolving

sungkim11 · 2024-04-13T02:38:11Z

Thank you for working on this.

robertgshaw2-neuralmagic · 2024-04-13T02:42:40Z

No problem. Thank you for reporting it :)

robertgshaw2-neuralmagic · 2024-04-13T03:23:03Z

Okay, I tried with:

HuggingFaceH4/zephyr-7b-beta
vllm/vllm-openai:latest

For whatever model and version of vllm that I used (upstream or downstream), I had issues when the following was included:

response_format = {
      "type": "json_object"
  },

Whenever this was removed, things worked properly.

JSON guided decoding is a relatively new feature in vLLM. I am going to dive in tomorrow to see if I can debug and let other maintainers know about the issue.

sungkim11 · 2024-04-13T04:34:06Z

I was wondering why it was not returning JSON as requested.

robertgshaw2-neuralmagic · 2024-04-13T12:44:38Z

The response_format = {"type": "json_object"} guides the model to generate JSON by modifying the predicted logits. This is a relatively new feature in vllm and appears to have a bug. Im looking into the cause

robertgshaw2-neuralmagic · 2024-04-14T20:42:59Z

@sungkim11 I am working with the upstream maintainers to look into this

sungkim11 · 2024-04-15T21:14:00Z

Thank you! I was wondering why I am getting blanks from vLLM as well. This bug may be originated from there.

sungkim11 added the bug Something isn't working label Apr 13, 2024

robertgshaw2-neuralmagic closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

sungkim11 commented Apr 13, 2024 •

edited

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 14, 2024

sungkim11 commented Apr 15, 2024

[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

Comments

sungkim11 commented Apr 13, 2024

Your current environment

🐛 Describe the bug

robertgshaw2-neuralmagic commented Apr 13, 2024 • edited

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

sungkim11 commented Apr 13, 2024 • edited

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024 • edited

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024

sungkim11 commented Apr 13, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024 • edited

robertgshaw2-neuralmagic commented Apr 14, 2024

sungkim11 commented Apr 15, 2024

robertgshaw2-neuralmagic commented Apr 13, 2024 •

edited

sungkim11 commented Apr 13, 2024 •

edited

sungkim11 commented Apr 13, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 13, 2024 •

edited