Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MI300 compatibility #1764

Merged
merged 58 commits into from May 17, 2024
Merged

MI300 compatibility #1764

merged 58 commits into from May 17, 2024

Conversation

fxmarty
Copy link
Collaborator

@fxmarty fxmarty commented Apr 18, 2024

Adds support for AMD Instinct MI300 in TGI.

Most changes are:

By default, TunableOp tuning results are saved in /data (e.g. /data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv) in order to avoid to have to rerun the tuning at each docker run.

Example:

Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489

Dockerfile_amd Outdated Show resolved Hide resolved
launcher/src/main.rs Outdated Show resolved Hide resolved
@seungrokj
Copy link

seungrokj commented May 16, 2024

@fxmarty Can you please add one more triton.Config in

        triton.Config(
            {
                "BLOCK_M": 128,
                "BLOCK_N": 64,
                "waves_per_eu": 1,
                "PRE_LOAD_V": False,
            },
            num_stages=1,
            num_warps=4,
        ),

This will improve the prefill latency of llama3 70b TP8 about 3.4 to 10%, when batch 1~32, seqlen=2048

@fxmarty
Copy link
Collaborator Author

fxmarty commented May 16, 2024

@Narsil This PR is ready. Could you give a look?

We are just waiting for a patched / updated rocm/dev-ubuntu-22.04 base image that would fix an issue with libamdhip64.so on certain VMs, avoiding

ARG GITHUB_TOKEN
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends wget && \
rm -rf /var/lib/apt/lists/* && \
wget --header "Authorization: token ${GITHUB_TOKEN}" https://raw.githubusercontent.com/fxmarty/patched_hipruntime/main/libamdhip64.so.6
ENV LD_PRELOAD="/libamdhip64.so.6"

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Narsil
Copy link
Collaborator

Narsil commented May 16, 2024

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

Sure releases are kind of trivial now.

Dockerfile_amd Outdated
Comment on lines 220 to 224
# COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
# RUN chmod +x /tgi-entrypoint.sh

# ENTRYPOINT ["/tgi-entrypoint.sh"]
# CMD ["--json-output"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clean

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Lots of nits and code structure suggestions, but overall everything looks good.

server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
@@ -0,0 +1,816 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we put this in layers/flash_attn/triton.py maybe ? (and flash_attn.py-> flash_attn/__init__.py for simplificity ?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do that in an other PR? Then e.g. paged_attention.py should be moved as well

server/text_generation_server/utils/flash_attn.py Outdated Show resolved Hide resolved
server/text_generation_server/models/t5.py Outdated Show resolved Hide resolved
bias = None
return cls(weight, bias)

def forward(self, inp: torch.Tensor) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inp/input/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out = torch.empty(
inp.shape[0], weight.shape[0], dtype=inp.dtype, device="cuda"
)
if (k == 8192 and (m == 1280 or m == 7168)) or (k == 3584 and m == 8192):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feel way overspecified, no ?
Is it really only implemented for these shapes ?

The second condition looks way more general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @gshtras @charlifu have you tested different rows_per_block? Was it specifically tuned for these shapes?

@fxmarty
Copy link
Collaborator Author

fxmarty commented May 17, 2024

As we got an updated rocm/dev-ubuntu-22.04:6.1.1_hip_update, this PR may be merged once build is done & tests are passing

@Narsil Narsil merged commit 232e8d5 into main May 17, 2024
9 of 10 checks passed
@Narsil Narsil deleted the mi300-compat branch May 17, 2024 13:30
fxmarty added a commit that referenced this pull request May 17, 2024
Not all models were tested in
#1764.

Fixing some more issues (notably starcoder2) here, the full CI will come
shortly once we split `build.yml` in two
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants