Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: I have created a docker image of 0.2.0 and ran same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns NULL #185

Closed
sungkim11 opened this issue Apr 13, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@sungkim11
Copy link

Your current environment

  1. Created a docker image
  2. Ran the docker image to inference the model.

馃悰 Describe the bug

I have created a docker image of 0.2.0 and ran the same model - neuralmagic/OpenHermes-2.5-Mistral-7B-marlin, it returns a series of blank spaces whereas 0.1.0 works fine.

@sungkim11 sungkim11 added the bug Something isn't working label Apr 13, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Apr 13, 2024

Hey @sungkim11 , thanks for reporting the issue.

Can you share:

  • The dockerfile you used to make the image
  • The inference example
  • Any info about the GPU setup you might have (e.g. GPU type, number of GPUs)

I just ran the following to install:

python3 -m venv env
source env/bin/activate
pip install nm-vllm

And then the following for inference, and it seemed to be okay:

from vllm import LLM
model = LLM("neuralmagic/OpenHermes-2.5-Mistral-7B-marlin", max_model_len=4096)
output = model.generate("Hello my name is")
print(output[0].outputs[0].text)
# >> Marissa Cariaga, I am 18 years old.  I

@robertgshaw2-neuralmagic
Copy link
Collaborator

Note: we also have a pre-made Docker image which should have everything you need.

docker run 
    --gpus all \
    --shm-size \
    ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0 --model neuralmagic/OpenHermes-2.5-Mistral-7B-marlin --max-model-len 4096

@sungkim11
Copy link
Author

I did the following today:

  1. git clone repo
  2. docker build -t nm-vllm-openai:0.2.0 .

Inference:

completion = client.chat.completions.create(
    model = llm_model_name,            
    response_format = { 
        "type": "json_object" 
    },
    messages = messages,
    temperature = temperature,
    max_tokens = max_tokens,            
    top_p = top_p,
    frequency_penalty = frequency_penalty,
    presence_penalty = presence_penalty,
    logprobs = logprobs
)

It works fine with 0.1.0 image, but gets bunch of blank line in 0.2.0

@sungkim11
Copy link
Author

sungkim11 commented Apr 13, 2024

You can pull the docker image from docker hub -> sungkimmw/nm-vllm-openai:0.2.0 and sungkimmw/nm-vllm:latest (for 0.1.0).

I need to delete the 0.1.0

@sungkim11
Copy link
Author

I tried "ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0 " and I am getting bunch of blank lines.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Just tried each of these:

  • sungkimmw/nm-vllm-openai:0.2.0
  • ghcr.io/neuralmagic/nm-vllm-openai:v0.2.0

With this client:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print("Chat completion results:")
print(chat_completion)

And got:

ChatCompletion(id='cmpl-00e280aade9d47448768b2d903b9b04a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The 2020 World Series was played in Arlington, Texas. The games took place at Globe Life Field and Globe Life Park, both of which are located in Arlington. This was due to the COVID-19 pandemic, which led to the games being played at a neutral site in order to minimize travel and potential exposure to the virus.', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1712972481, model='neuralmagic/OpenHermes-2.5-Mistral-7B-marlin', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=75, prompt_tokens=67, total_tokens=142))

@robertgshaw2-neuralmagic
Copy link
Collaborator

Could you provide the exact client code you are running?

@sungkim11
Copy link
Author

sungkim11 commented Apr 13, 2024

def prompt_json_completion(messages):
        
    base_url = os.getenv("BASE_URL", "http://localhost:8000/v1")
    api_key = os.getenv("API_KEY", "EMPTY")
    llm_model_name = os.getenv("LLM_MODEL_NAME", "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin")
    temperature = os.getenv("TEMPERATURE", 0.0)
    max_tokens = os.getenv("MAX_TOKENS", 100)
    top_p = os.getenv("TOP_P", 1)
    frequency_penalty = os.getenv("FREQUENCY_PENALTY", 0)
    presence_penalty = os.getenv("PRESENCE_PENALTY", 0)
    logprobs = os.getenv("LOGPROBS", "true")

    client = OpenAI(api_key = api_key, base_url = base_url)

    completion = client.chat.completions.create(
        model = llm_model_name,            
        response_format = { 
            "type": "json_object" 
        },
        messages = messages,
        temperature = temperature,
        max_tokens = max_tokens,            
        top_p = top_p,
        frequency_penalty = frequency_penalty,
        presence_penalty = presence_penalty,
    )
    #print(completion)
    print(completion.choices[0].message.content)

if __name__ == "__main__":

    user_prompt = "Must I be entitled to claim a child as a dependent to claim the earned income credit based on the child being my qualifying adult?"

    messages = [
                {"role": "system", "content": "You are a helpful assistant designed to output JSON only. Please augment the user question with a context."},
                {"role": "user", "content": "DO NOT answer the query just augment the query and return the augmented user question without any explanations: " + user_prompt}
            ]

  aug_query = prompt_json_completion(messages = messages)

@robertgshaw2-neuralmagic
Copy link
Collaborator

Thanks. Reproduced.

Everything is working fine with the completions API, but not with the chat completions API. I believe I know what caused this issue, will work on resolving

@sungkim11
Copy link
Author

Thank you for working on this.

@robertgshaw2-neuralmagic
Copy link
Collaborator

No problem. Thank you for reporting it :)

@robertgshaw2-neuralmagic
Copy link
Collaborator

Okay, I tried with:

  • HuggingFaceH4/zephyr-7b-beta
  • vllm/vllm-openai:latest

For whatever model and version of vllm that I used (upstream or downstream), I had issues when the following was included:

response_format = {
      "type": "json_object"
  },

Whenever this was removed, things worked properly.

JSON guided decoding is a relatively new feature in vLLM. I am going to dive in tomorrow to see if I can debug and let other maintainers know about the issue.

@sungkim11
Copy link
Author

I was wondering why it was not returning JSON as requested.

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Apr 13, 2024

The response_format = {"type": "json_object"} guides the model to generate JSON by modifying the predicted logits. This is a relatively new feature in vllm and appears to have a bug. Im looking into the cause

@robertgshaw2-neuralmagic
Copy link
Collaborator

@sungkim11 I am working with the upstream maintainers to look into this

@sungkim11
Copy link
Author

Thank you! I was wondering why I am getting blanks from vLLM as well. This bug may be originated from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants