We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi I use this:
server_vllm.py \ --model "/data/models_temp/functionary-small-v2.4/" \ --served-model-name "functionary" \ --dtype=bfloat16 \ --max-model-len 2048 \ --host 0.0.0.0 \ --port 8000 \ --enforce-eager \ --gpu-memory-utilization 0.94
on rtx 3090 24gb
Why I've got low speed?: Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
This is my config:
| INFO 05-11 08:17:48 server_vllm.py:473] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='functionary', grammar_sampling=False, model='/data/models_temp/functionary-small-v2.4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.94, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) functionary | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers functionary | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. functionary | INFO 05-11 08:17:49 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/models_temp/functionary-small-v2.4/', speculative_config=None, tokenizer='/data/models_temp/functionary-small-v2.4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) functionary | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. functionary | INFO 05-11 08:17:50 utils.py:608] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 functionary | INFO 05-11 08:17:50 selector.py:28] Using FlashAttention backend. functionary | INFO 05-11 08:17:53 model_runner.py:173] Loading model weights took 13.4976 GB functionary | INFO 05-11 08:17:53 gpu_executor.py:119] # GPU blocks: 4185, # CPU blocks: 2048 functionary | INFO: Started server process [19] functionary | INFO: Waiting for application startup. functionary | INFO: Application startup complete. functionary | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Report of performance regression
Hi I use this:
on rtx 3090 24gb
Why I've got low speed?:
Avg prompt throughput: 102.2 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
This is my config:
The text was updated successfully, but these errors were encountered: