MI300 compatibility #1764

fxmarty · 2024-04-18T23:31:56Z

Adds support for AMD Instinct MI300 in TGI.

Most changes are:

Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with PYTORCH_TUNABLEOP_ENABLED=1.
Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from [ROCm] TunableOp improvements pytorch/pytorch#124362)
Support SILU & Linear custom kernels contributed by AMD
Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit ROCm/vllm@3489ce7
Support FA2 Triton kernel as recommended by AMD. Can be used by specifying ROCM_USE_FLASH_ATTN_V2_TRITON=1.
Update dockerfile to ROCm 6.1

By default, TunableOp tuning results are saved in /data (e.g. /data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv) in order to avoid to have to rerun the tuning at each docker run.

Example:

Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489

Dockerfile_amd

launcher/src/main.rs

server/text_generation_server/utils/layers.py

server/text_generation_server/utils/flash_attn.py

seungrokj · 2024-05-16T03:00:06Z

@fxmarty Can you please add one more triton.Config in

text-generation-inference/server/text_generation_server/utils/flash_attn_triton.py

Line 261 in b7e98ba

),

        triton.Config(
            {
                "BLOCK_M": 128,
                "BLOCK_N": 64,
                "waves_per_eu": 1,
                "PRE_LOAD_V": False,
            },
            num_stages=1,
            num_warps=4,
        ),

This will improve the prefill latency of llama3 70b TP8 about 3.4 to 10%, when batch 1~32, seqlen=2048

fxmarty · 2024-05-16T11:49:29Z

@Narsil This PR is ready. Could you give a look?

We are just waiting for a patched / updated rocm/dev-ubuntu-22.04 base image that would fix an issue with libamdhip64.so on certain VMs, avoiding

text-generation-inference/Dockerfile_amd

Lines 117 to 122 in afc7473

    
           ARG GITHUB_TOKEN 
        
           RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends wget && \ 
        
               rm -rf /var/lib/apt/lists/* && \ 
        
               wget --header "Authorization: token ${GITHUB_TOKEN}" https://raw.githubusercontent.com/fxmarty/patched_hipruntime/main/libamdhip64.so.6 
        
           ENV LD_PRELOAD="/libamdhip64.so.6"

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

HuggingFaceDocBuilderDev · 2024-05-16T11:51:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil · 2024-05-16T13:50:29Z

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

Sure releases are kind of trivial now.

fxmarty · 2024-05-16T13:57:03Z

Dockerfile_amd

+# COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
+# RUN chmod +x /tgi-entrypoint.sh
+
+# ENTRYPOINT ["/tgi-entrypoint.sh"]
+# CMD ["--json-output"]


Narsil

Very nice.

Lots of nits and code structure suggestions, but overall everything looks good.

server/text_generation_server/models/flash_causal_lm.py

Narsil · 2024-05-16T14:07:30Z

server/text_generation_server/utils/flash_attn_triton.py

@@ -0,0 +1,816 @@
+#!/usr/bin/env python


Shouldn't we put this in layers/flash_attn/triton.py maybe ? (and flash_attn.py-> flash_attn/__init__.py for simplificity ?)

Could we do that in an other PR? Then e.g. paged_attention.py should be moved as well

server/text_generation_server/utils/flash_attn.py

server/text_generation_server/models/t5.py

Narsil · 2024-05-16T14:25:21Z

server/text_generation_server/layers/linear.py

+            bias = None
+        return cls(weight, bias)
+
+    def forward(self, inp: torch.Tensor) -> torch.Tensor:


s/inp/input/

isn't input bad? https://stackoverflow.com/questions/20670732/is-input-a-keyword-in-python

server/text_generation_server/layers/linear.py

Narsil · 2024-05-16T14:29:39Z

server/text_generation_server/layers/linear.py

+            out = torch.empty(
+                inp.shape[0], weight.shape[0], dtype=inp.dtype, device="cuda"
+            )
+            if (k == 8192 and (m == 1280 or m == 7168)) or (k == 3584 and m == 8192):


This feel way overspecified, no ?
Is it really only implemented for these shapes ?

The second condition looks way more general.

cc @gshtras @charlifu have you tested different rows_per_block? Was it specifically tuned for these shapes?

fxmarty · 2024-05-17T09:18:29Z

As we got an updated rocm/dev-ubuntu-22.04:6.1.1_hip_update, this PR may be merged once build is done & tests are passing

Not all models were tested in #1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

at last working!

3016e15

fxmarty commented Apr 18, 2024

View reviewed changes

Dockerfile_amd Outdated Show resolved Hide resolved

fxmarty added 5 commits April 19, 2024 09:09

tunableop in warmup

b503b3d

wip fa2 triton & fix cudagraph bug

47e522a

WIP debug Triton FA2

0ca83be

working

f723e5c

_custom_C.LLMM1 and HIP_FORCE_DEV_KERNARG=1

1b4c8b4

fxmarty commented Apr 19, 2024

View reviewed changes

launcher/src/main.rs Outdated Show resolved Hide resolved

fxmarty added 5 commits April 19, 2024 11:57

cleaning

ec5343e

add missing files

8eacae0

revert dev only changes

6d59eb2

fix

562cd4b

disable _custom_C.LLMM1 as it is broken for TP>=2

81c27ba

fxmarty commented Apr 19, 2024

View reviewed changes

server/text_generation_server/utils/layers.py Outdated Show resolved Hide resolved

fxmarty commented Apr 19, 2024

View reviewed changes

server/text_generation_server/utils/flash_attn.py Outdated Show resolved Hide resolved

fxmarty and others added 15 commits April 19, 2024 16:19

reenable _custom_C.LLMM1 as the culprit was FA2 triton

325f977

fix fa2 triton kernel not working with MQA/GQA

aef931e

add LLMM_Silu

fbc5a6a

black

e728970

Merge branch 'main' into mi300-compat

7502367

run integration tests on rocm

b8da902

use released torch 2.3

193dbb6

working & cached tunableop

17f5c30

trying to update to ROCm 6.1

a509360

wip fix tunableop

2677bf8

Merge branch 'main' into mi300-compat

8ec3b1a

working tunable

ff5e16b

Merge branch 'mi300-temp' into mi300-compat

d2b4b02

add model id

51b0c25

tunableop on 1,...,8

1f37d57

fxmarty added 2 commits May 15, 2024 12:06

fix merge issues

c683597

fix various merge errors

b7e98ba

fxmarty added 3 commits May 16, 2024 09:00

apply suggestions

f8d37c1

Merge branch 'main' into mi300-compat

f219124

documentation

afc7473

typo

0812e3b

fxmarty commented May 16, 2024

View reviewed changes

Narsil reviewed May 16, 2024

View reviewed changes

fxmarty added 11 commits May 16, 2024 14:46

black

265c76d

update version

c945573

fixes on review

df0a453

refactor model_id, make tunableop default

a040a59

reflect in doc that tunableop is default

c847559

remove unnecessary imports

7c6b9a0

diff nicer

8d7f18f

nicer diff x2

3ded96f

cleanup fastlinear

956ac30

precise amd doc

2a7ba6e

cleanup dockerfile

eea3226

tentatively fix build workflow

f5007eb

Narsil merged commit 232e8d5 into main May 17, 2024
9 of 10 checks passed

Narsil deleted the mi300-compat branch May 17, 2024 13:30

drbh mentioned this pull request May 17, 2024

Update documentation version 1.4 -> 2.0.3 #1916

Closed

fxmarty mentioned this pull request May 17, 2024

Fix TGI issues with ROCm #1921

Merged

fxmarty added a commit that referenced this pull request May 17, 2024

Fix TGI issues with ROCm (#1921)

5dad0c0

Not all models were tested in #1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MI300 compatibility #1764

MI300 compatibility #1764

fxmarty commented Apr 18, 2024 •

edited

seungrokj commented May 16, 2024 •

edited

fxmarty commented May 16, 2024

HuggingFaceDocBuilderDev commented May 16, 2024

Narsil commented May 16, 2024

fxmarty May 16, 2024

Narsil left a comment

Narsil May 16, 2024

fxmarty May 17, 2024

Narsil May 16, 2024

fxmarty May 17, 2024

Narsil May 16, 2024

fxmarty May 17, 2024

fxmarty commented May 17, 2024

MI300 compatibility #1764

MI300 compatibility #1764

Conversation

fxmarty commented Apr 18, 2024 • edited

seungrokj commented May 16, 2024 • edited

fxmarty commented May 16, 2024

HuggingFaceDocBuilderDev commented May 16, 2024

Narsil commented May 16, 2024

fxmarty May 16, 2024

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

fxmarty commented May 17, 2024

fxmarty commented Apr 18, 2024 •

edited

seungrokj commented May 16, 2024 •

edited