Regarding llama3-70b-instruct #1864

chintanshrinath · 2024-05-06T11:13:18Z

Dear
I am trying to load full model on A100-80 GB of 8 cores using below command.
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-input-length 8000 --max-total-tokens 8010

However, it is not using all GPU core.
I also looked num_shard, but didn't get it.

Can you help here to to use all core and optimize the above command. The main concern is that we need to decrease inference time for production grade.
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding llama3-70b-instruct #1864

Regarding llama3-70b-instruct #1864

chintanshrinath commented May 6, 2024

Regarding llama3-70b-instruct #1864

Regarding llama3-70b-instruct #1864

Comments

chintanshrinath commented May 6, 2024