[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works) #2296

bayley · 2024-05-08T01:18:25Z

🐛 Bug

Not sure if this is an issue with my compilation settings or with MLC-LLM; I have an 8x V100 16GB SXM2 system (HPE XL270D gen10) and can build a library and quantized weights that work fine with the mlc_llm chat command, but the same files fail when passed to mlc_llm serve after the server receives a request on the completions endpoint:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/mlc_llm/serve/engine_base.py", line 484, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/user/.local/lib/python3.10/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  13: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:168
  12: mlc::llm::serve::EngineImpl::Step()
        at /workspace/mlc-llm/cpp/serve/engine.cc:326
  11: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
        at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:235
  10: mlc::llm::serve::GPUSampler::BatchSampleTokensWithProbAfterTopP(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:179
  9: mlc::llm::serve::GPUSampler::BatchSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<tvm::runtime::String, void> const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool, std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >*)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:369
  8: mlc::llm::serve::GPUSampler::ChunkSampleTokensImpl(tvm::runtime::NDArray, std::vector<int, std::allocator<int> > const&, tvm::runtime::Array<mlc::llm::serve::GenerationConfig, void> const&, std::vector<mlc::llm::RandomGenerator*, std::allocator<mlc::llm::RandomGenerator*> > const&, bool)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:450
  7: mlc::llm::serve::GPUSampler::SampleOnGPU(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, bool, int, std::vector<int, std::allocator<int> > const&)
        at /workspace/mlc-llm/cpp/serve/sampler/gpu_sampler.cc:567
  6: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeClosurePacked(tvm::runtime::ObjectRef const&, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::relax_vm::VirtualMachineImpl::GetClosureInternal(tvm::runtime::String const&, bool)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  4: tvm::runtime::relax_vm::VirtualMachineImpl::InvokeBytecode(long, std::vector<tvm::runtime::TVMRetValue, std::allocator<tvm::runtime::TVMRetValue> > const&)
  3: tvm::runtime::relax_vm::VirtualMachineImpl::RunLoop()
  2: tvm::runtime::relax_vm::VirtualMachineImpl::RunInstrCall(tvm::runtime::relax_vm::VMFrame*, tvm::runtime::relax_vm::Instruction)
  1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int (*)(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  0: TVMThrowLastError.cold
TVMError: after determining tmp storage requirements for inclusive_scan: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

This is code which sends the request:

import requests

models = requests.get("http://127.0.0.1:8000/v1/models", headers= {"accept": "application/json"})
model_name = models.json()['data'][0]['id']
print(model_name)

# Get a response using a prompt without streaming
payload = {
   "model": model_name,
   "messages": [
      {"role": "user", "content": "Write a haiku about apples."},
   ],
   "stream": True,
   # "n": 1,
   "max_tokens": 8192,
}

r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)
choices = r.json()["choices"]
for choice in choices:
   print(f"{choice['message']['content']}\n")

I tried a number of combinations of engine mode and max_tokens thinking it might be a kernel for a particular batch size that was missing, but it seemed to have no effect. Maybe the server is trying to run some variant of FlashAttention/FlashInfer and failing because there are no Flash kernels for SM70?

The text was updated successfully, but these errors were encountered:

bayley · 2024-05-08T11:49:27Z

Looks like the problem is in MLCEngine - this is a minimal reproducer (using the latest nightlies):

from mlc_llm import MLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

bayley · 2024-05-09T02:47:21Z

I did confirm the script works on an A4000 RunPod instance, so this is definitely a bug related to pre-SM80 GPUs. I'm happy to help fix (chat works and performs well so this is clearly possible...) if someone gives me some guidance on where to start.

I also tried rebuilding mlc-llm from source (but using the prebuilt mlc-ai nightly wheel) and it didn't help.

tqchen · 2024-05-09T13:53:20Z

likely we need to build without flashinfer (as it does not support pre-SM80)

@Hzfengsy have some followup replacement that might help

bayley · 2024-05-09T14:19:30Z

Thanks. What exactly do I need to rebuild without flashinfer? I tried explicitly disabling flashinfer (and cutlass) during model lib compilation but it didn't help.

tqchen · 2024-05-09T15:11:53Z

For now, we might need to build mlc-llm from source by setting flashinfer OFF

https://llm.mlc.ai/docs/install/mlc_llm.html#option-2-build-from-source

bayley · 2024-05-09T15:15:50Z

I tried that and it didn't help, I can go back and double check my build settings to make sure though. I did use the prebuilt mlc-ai wheel, could that be a problem? (mlc-llm built from source, mlc-ai from prebuilt wheel)

tqchen · 2024-05-09T17:23:30Z

ah yes we also need build tvm from source, since the flashinfer was packaged through runtime there

bayley · 2024-05-09T19:20:33Z

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

tqchen · 2024-05-09T19:35:02Z

I think we could update engine to have a runtime compatibility check, so we can leverage flashinfer wheel. However, looking a bit more, because there are so many compute compact vector that increases binary size, so we have to cut down old compute compact in a single wheel. So likely a separate build is needed. Likely build from source is best course of action as of now. We can consider have runtime option to turn off flashinfer

bayley · 2024-05-10T02:48:57Z

Success, rebuilt TVM from source following the instructions in the docs (I had to install libzstd-dev through apt) and now MLCEngine works.

bayley · 2024-05-11T03:30:30Z

When FlashInfer is disabled, what prefill algorithm is used? I noticed a pretty long prompt processing time on Llama-70B and was wondering if it internally used memory-efficient attention (xformers/Pytorch SDPA) or the naive algorithm.

tqchen · 2024-05-11T11:42:25Z

We use a TensorIr variant of flashinfer which normally was at 80 to 90 percent of flashinfer efficiency. Note this is for decode, still need to confirm prefill

bayley · 2024-05-11T11:57:39Z

OK. Does the REST server store stats like the chat interface does? Would be useful to check prefill tokens per second, etc. for benchmarking.

tqchen · 2024-05-11T12:12:37Z

The stats is still something WIP, but indeed that is a great suggestion

Nero10578 · 2024-05-14T09:28:23Z

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

Also interested in this for running on Tesla P40s.

bayley · 2024-05-14T09:33:57Z

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used P40 is a popular hobby GPU for LLMs)

Also interested in this for running on Tesla P40s.

It's pretty easy to run the build yourself, just make sure you have the right version of LLVM when building TVM or else you'll get confusing errors.

bayley added the bug Confirmed bugs label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works) #2296

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works) #2296

bayley commented May 8, 2024 •

edited

bayley commented May 8, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024 •

edited

bayley commented May 9, 2024

tqchen commented May 9, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024 •

edited

bayley commented May 10, 2024

bayley commented May 11, 2024

tqchen commented May 11, 2024 •

edited

bayley commented May 11, 2024

tqchen commented May 11, 2024

Nero10578 commented May 14, 2024

bayley commented May 14, 2024

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works) #2296

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works) #2296

Comments

bayley commented May 8, 2024 • edited

🐛 Bug

bayley commented May 8, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024 • edited

bayley commented May 9, 2024

tqchen commented May 9, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024

bayley commented May 9, 2024

tqchen commented May 9, 2024 • edited

bayley commented May 10, 2024

bayley commented May 11, 2024

tqchen commented May 11, 2024 • edited

bayley commented May 11, 2024

tqchen commented May 11, 2024

Nero10578 commented May 14, 2024

bayley commented May 14, 2024

bayley commented May 8, 2024 •

edited

tqchen commented May 9, 2024 •

edited

tqchen commented May 9, 2024 •

edited

tqchen commented May 11, 2024 •

edited