-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: torch._dynamo.exc.BackendCompilerFailed with command-r-plus #472
Comments
Also getting this error for turboderp/command-r-plus-103B-exl2 on 2x4090s on Runpod (EDIT: and also
I wonder if these are related?
But latest official Docker image should have that change: So maybe not related. I tried setting Click for full error logs
|
@AlpinDale Please ignore if this issue is a wontfix (and please forgive this ping in that case 🙏) -- just in case this slipped through the cracks: I can reproduce OP's issue. See my above comment for reproduction details + logs. The TL;DR is that Edit: I can also reproduce with |
I'll get to investigating this soon; I've been busy with other projects so I haven't had much time to work on aphrodite lately. I have an inkling that this is related to torch.compile(). |
Your current environment
aphrodite docker container
Setting 1
GPUs: RTX8000 * 2
model: alpindale/c4ai-command-r-plus-GPTQ
Quantization: gptq
Setting 2
GPUs: A6000 ada * 4
model: CohereForAI/c4ai-command-r-plus
Quantization: load-in-smooth
🐛 Describe the bug
Starting Aphrodite Engine API server...
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
WARNING: gptq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.
2024-05-17 02:21:49,653 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = 'alpindale/c4ai-command-r-plus-GPTQ'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gptq
INFO: Context Length = 29000
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend =
DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: The tokenizer's vocabulary size 255029 does not match the model's
vocabulary size 256000.
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
INFO: Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO: Using XFormers backend.
(RayWorkerAphrodite pid=1127) INFO: Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerAphrodite pid=1127) INFO: Using XFormers backend.
INFO: Aphrodite is using nccl==2.20.5
(RayWorkerAphrodite pid=1127) INFO: Aphrodite is using nccl==2.20.5
INFO: generating GPU P2P access cache for in
/app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
INFO: reading GPU P2P access cache from
/app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=1127) INFO: reading GPU P2P access cache from
(RayWorkerAphrodite pid=1127) /app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=1127) INFO: Using model weights format ['.safetensors']
INFO: Using model weights format ['.safetensors']
INFO: Model weights loaded. Memory usage: 27.78 GiB x 2 = 55.55 GiB
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 562, in
[rank0]: run_server(args)
[rank0]: File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
[rank0]: engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]: File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
[rank0]: engine = cls(engine_config.parallel_config.worker_use_ray,
[rank0]: File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 323, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 142, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 182, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 208, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 309, in _run_workers
[rank0]: driver_worker_output = getattr(self.driver_worker,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 144, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 948, in profile_run
[rank0]: self.execute_model(seqs, kv_caches)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 868, in execute_model
[rank0]: hidden_states = model_executable(**execute_model_kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 390, in forward
[rank0]: hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 349, in forward
[rank0]: hidden_states, residual = layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 305, in forward
[rank0]: hidden_states, residual = self.input_layernorm(hidden_states, residual)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/app/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 82, in forward
[rank0]: hidden_states = layer_norm_func(hidden_states, self.weight,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
[rank0]: return callback(frame, cache_entry, hooks, frame_state, skip=1)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame
[rank0]: result = inner_convert(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
[rank0]: return _compile(
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]: return func(*args, **kwds)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
[rank0]: guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
[rank0]: out_code = transform_code_object(code, transform)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
[rank0]: transformations(instructions, code_options)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 500, in transform
[rank0]: tracer.run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
[rank0]: super().run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
[rank0]: and self.step()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
[rank0]: getattr(self, inst.opname)(inst)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE
[rank0]: self.output.compile_subgraph(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 971, in compile_subgraph
[rank0]: self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]: return func(*args, **kwds)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 1168, in compile_and_call_fx_graph
[rank0]: compiled_fn = self.call_user_compiler(gm)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/output_graph.py", line 1241, in call_user_compiler
[rank0]: raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/dynamo/output_graph.py", line 1222, in call_user_compiler
[rank0]: compiled_fn = compiler_fn(gm, self.example_inputs())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
[rank0]: compiled_gm = compiler_fn(gm, example_inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/init.py", line 1729, in call
[rank0]: return compile_fx(model, inputs, config_patches=self.config)
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]: return func(*args, **kwds)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx
[rank0]: return aot_autograd(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn
[rank0]: cg = aot_module_simplified(gm, example_inputs, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified
[rank0]: compiled_fn = create_aot_dispatcher_function(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function
[rank0]: compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe
[rank0]: return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base
[rank0]: return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 119, in aot_dispatch_base
[rank0]: compiled_fw = compiler(fw_module, updated_flat_args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base
[rank0]: return inner_compile(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
[rank0]: inner_compiled_fn = compiler_fn(gm, example_inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/debug.py", line 304, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]: return func(*args, **kwds)
[rank0]: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
[rank0]: return func(*args, **kwds)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner
[rank0]: compiled_graph = fx_codegen_and_compile(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile
[rank0]: compiled_fn = graph.compile_to_fn()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn
[rank0]: return self.compile_to_module().call
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py", line 1250, in compile_to_module
[rank0]: self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py", line 1208, in codegen
[rank0]: self.scheduler.codegen()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/dynamo/utils.py", line 262, in time_wrapper
[rank0]: r = func(args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/scheduler.py", line 2339, in codegen
[rank0]: self.get_backend(device).codegen_nodes(node.get_nodes()) # type: ignore[possibly-undefined]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/cuda_combined_scheduling.py", line 63, in codegen_nodes
[rank0]: return self._triton_scheduling.codegen_nodes(nodes)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3255, in codegen_nodes
[rank0]: return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3427, in codegen_node_schedule
[rank0]: kernel_name = self.define_kernel(src_code, node_schedule)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codegen/triton.py", line 3537, in define_kernel
[rank0]: basename, _, kernel_path = get_path(code_hash(src_code.strip()), "py")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/codecache.py", line 349, in get_path
[rank0]: subdir = os.path.join(cache_dir(), basename[1:3])
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/utils.py", line 739, in cache_dir
[rank0]: sanitized_username = re.sub(r'[\/:?"<>|]', "", getpass.getuser())
[rank0]: File "/usr/lib/python3.10/getpass.py", line 169, in getuser
[rank0]: return pwd.getpwuid(os.getuid())[0]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank0]: KeyError: 'getpwuid(): uid not found: 1000'
[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
[rank0]: You can suppress this exception and fall back to eager by setting:
[rank0]: import torch._dynamo
[rank0]: torch._dynamo.config.suppress_errors = True
(RayWorkerAphrodite pid=1127) INFO: Model weights loaded. Memory usage: 27.78 GiB x 2 = 55.55 GiB
(RayWorkerAphrodite pid=1127) ERROR: Error executing method determine_num_available_blocks. This might
(RayWorkerAphrodite pid=1127) cause deadlock in distributed execution.
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
This is the log generated with gptq version. The same errors are raised when running with non quantized version of the model. gptq version works fine on vllm.
The text was updated successfully, but these errors were encountered: