-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused llama kernel #10266
base: master
Are you sure you want to change the base?
Fused llama kernel #10266
Conversation
std::map<int, std::shared_ptr<OpExpr>> ops_; | ||
std::map<int, std::shared_ptr<OpExpr>> ops_with_past_key_value_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用hash map会快一点
const TensorTuple& past_values, const int64_t head_size) const { | ||
int64_t num_layers = input_norm_weights.size(); | ||
auto& attrs = THREAD_CACHED_MUTABLE_ATTR_MAP("head_size", "num_layers", "parallel_conf"); | ||
auto conf = PbMessage2TxtString(JUST(hidden_states->parallel_desc())->parallel_conf()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个可以缓存一下,proto对象每次计算序列化比较耗时
.Output("rms_norm_out") | ||
.Output("inv_rms") | ||
.Output("query") | ||
.Output("key") | ||
.Output("value") | ||
.Output("rotary_query") | ||
.Output("rotary_key") | ||
.Output("concat_keys", num_layers) | ||
.Output("concat_values", num_layers) | ||
.Output("attn_out") | ||
.Output("out") | ||
.Output("post_norm_out") | ||
.Output("gate_out") | ||
.Output("glu_out") | ||
.Output("decoder_out") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是只需要输出"concat_keys" , "concat_values" 和"decoder_out"就可以了,如果是中间变量的话可以用temp buffer
fast transformer 是这样做的吗? llama 的 python 实现需要手工改动吗? 还是自动通过模式匹配实现的? |
fast transformer是纯c++实现,可以认为是一个专用实现,代码中实现了一个 fast transformer主仓库中比较成熟的实现如GPT,也是基本上是这个套路,其pytorch和tensorflow实现就是将c++端的
使用融合算子时需要手工改动代码。 |
之前提到推理时有个动态 shape 的问题,它是取 max 去申请了内存么 |
是的,申请了最大所需内存 |
ft 的kv_cache 按最大所需长度max_cache_seq_len分配显存 见 https://github.com/void-main/FasterTransformer/blob/main/src/fastertransformer/models/llama/Llama.cc#L102 |
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally. |
28f8b87
to
3470229
Compare
3470229
to
d4d5218
Compare
|
||
namespace oneflow { | ||
namespace cuda { | ||
namespace rms_norm_output_norm_arg { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以加个注释,说明有两个输出,分别指什么。
llama模型并行推理优化,将每一层LlamaDecoderLayer 所有的cuda kernel放在一个大op里, 尽可能减少python层面指令发送的延迟。