New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MI300 compatibility #1764
MI300 compatibility #1764
Conversation
@fxmarty Can you please add one more triton.Config in text-generation-inference/server/text_generation_server/utils/flash_attn_triton.py Line 261 in b7e98ba
This will improve the prefill latency of llama3 70b TP8 about 3.4 to 10%, when batch 1~32, seqlen=2048 |
@Narsil This PR is ready. Could you give a look? We are just waiting for a patched / updated text-generation-inference/Dockerfile_amd Lines 117 to 122 in afc7473
We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Sure releases are kind of trivial now. |
Dockerfile_amd
Outdated
# COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh | ||
# RUN chmod +x /tgi-entrypoint.sh | ||
|
||
# ENTRYPOINT ["/tgi-entrypoint.sh"] | ||
# CMD ["--json-output"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
Lots of nits and code structure suggestions, but overall everything looks good.
@@ -0,0 +1,816 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we put this in layers/flash_attn/triton.py
maybe ? (and flash_attn.py
-> flash_attn/__init__.py
for simplificity ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do that in an other PR? Then e.g. paged_attention.py should be moved as well
bias = None | ||
return cls(weight, bias) | ||
|
||
def forward(self, inp: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/inp/input/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out = torch.empty( | ||
inp.shape[0], weight.shape[0], dtype=inp.dtype, device="cuda" | ||
) | ||
if (k == 8192 and (m == 1280 or m == 7168)) or (k == 3584 and m == 8192): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feel way overspecified, no ?
Is it really only implemented for these shapes ?
The second condition looks way more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we got an updated |
Not all models were tested in #1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two
Adds support for AMD Instinct MI300 in TGI.
Most changes are:
PYTORCH_TUNABLEOP_ENABLED=1
.ROCM_USE_FLASH_ATTN_V2_TRITON=1
.By default, TunableOp tuning results are saved in
/data
(e.g./data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv
) in order to avoid to have to rerun the tuning at eachdocker run
.Example: