Fused llama kernel #10266

ofhwei · 2023-05-15T04:30:41Z

llama模型并行推理优化，将每一层LlamaDecoderLayer 所有的cuda kernel放在一个大op里, 尽可能减少python层面指令发送的延迟。

…nccl_use_compute_stream

…neflow into fused_llama_kernel

clackhan · 2023-05-16T03:06:47Z

oneflow/core/functional/impl/fused_attention_functor.cpp

+  std::map<int, std::shared_ptr<OpExpr>> ops_;
+  std::map<int, std::shared_ptr<OpExpr>> ops_with_past_key_value_;


使用hash map会快一点

clackhan · 2023-05-16T03:08:33Z

oneflow/core/functional/impl/fused_attention_functor.cpp

+      const TensorTuple& past_values, const int64_t head_size) const {
+    int64_t num_layers = input_norm_weights.size();
+    auto& attrs = THREAD_CACHED_MUTABLE_ATTR_MAP("head_size", "num_layers", "parallel_conf");
+    auto conf = PbMessage2TxtString(JUST(hidden_states->parallel_desc())->parallel_conf());


这个可以缓存一下，proto对象每次计算序列化比较耗时

clackhan · 2023-05-16T03:15:01Z

oneflow/core/functional/impl/fused_attention_functor.cpp

+                         .Output("rms_norm_out")
+                         .Output("inv_rms")
+                         .Output("query")
+                         .Output("key")
+                         .Output("value")
+                         .Output("rotary_query")
+                         .Output("rotary_key")
+                         .Output("concat_keys", num_layers)
+                         .Output("concat_values", num_layers)
+                         .Output("attn_out")
+                         .Output("out")
+                         .Output("post_norm_out")
+                         .Output("gate_out")
+                         .Output("glu_out")
+                         .Output("decoder_out")


是不是只需要输出"concat_keys" ， "concat_values" 和"decoder_out"就可以了，如果是中间变量的话可以用temp buffer

yuanms2 · 2023-05-16T04:08:30Z

fast transformer 是这样做的吗？

llama 的 python 实现需要手工改动吗？还是自动通过模式匹配实现的？

clackhan · 2023-05-16T04:34:14Z

fast transformer 是这样做的吗？

fast transformer是纯c++实现，可以认为是一个专用实现，代码中实现了一个Llama类，编译生成一个可行性的二进制文件，运行时创建一个Llama实例，在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作，所以整个算子launch过程非常快。目前Llama还处于第三方pr状态，没有python实现。

fast transformer主仓库中比较成熟的实现如GPT，也是基本上是这个套路，其pytorch和tensorflow实现就是将c++端的class GptOp包装一下导出到python端。

llama 的 python 实现需要手工改动吗？还是自动通过模式匹配实现的？

使用融合算子时需要手工改动代码。

strint · 2023-05-16T07:13:01Z

在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作

之前提到推理时有个动态 shape 的问题，它是取 max 去申请了内存么

clackhan · 2023-05-16T07:17:18Z

在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作

之前提到推理时有个动态 shape 的问题，它是取 max 去申请了内存么

是的，申请了最大所需内存

ofhwei · 2023-05-16T07:30:30Z

在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作

ft 的kv_cache 按最大所需长度max_cache_seq_len分配显存见 https://github.com/void-main/FasterTransformer/blob/main/src/fastertransformer/models/llama/Llama.cc#L102

github-actions · 2023-05-22T14:46:53Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

clackhan · 2023-05-26T01:40:46Z

oneflow/core/cuda/rms_norm_output_norm_arg.cuh

+
+namespace oneflow {
+namespace cuda {
+namespace rms_norm_output_norm_arg {


这里可以加个注释，说明有两个输出，分别指什么。

ofhwei added 19 commits May 5, 2023 07:30

add env var ONEFLOW_EAGER_NCCL_USE_COMPUTE_STREAM

45f2145

refine

cd06ffe

rename to EagerNcclUseComputeStream

3ee5986

Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …

108a7fb

…nccl_use_compute_stream

init

b1b9177

Merge branch 'fused_llama_kernel' of https://github.com/Oneflow-Inc/o…

bb07fc9

…neflow into fused_llama_kernel

merge main

3b6cb1e

refine

a1efe4e

merge main

a9f589b

use grouped_matmul

651dcb8

Merge branch 'fused_llama_kernel' of https://github.com/Oneflow-Inc/o…

f4da22e

…neflow into fused_llama_kernel

merge master

3fa1c95

refine

f2d7b6e

add allreduce

8c5c480

update: add residual

f770ae6

tmp add decoderlayer

ef0f41d

merge decoderlayer

8fa2f28

fix glu split dim

7c0960f

add multi decoder layers

6ffea52

ofhwei requested a review from clackhan May 15, 2023 04:30

ofhwei requested review from hjchen2, BBuf, jackalcooper and liujuncheng as code owners May 15, 2023 04:30

clackhan reviewed May 16, 2023

View reviewed changes

ofhwei requested review from doombeaker, daquexian, chengtbf and strint as code owners May 19, 2023 05:26

ofhwei force-pushed the fused_llama_kernel branch from 28f8b87 to 3470229 Compare May 25, 2023 09:42

ofhwei and others added 10 commits May 25, 2023 18:53

use tmp_buffer

09ffb9b

ir rename to llama_decoder_layer_forward

09af18c

fuse rms and add

795702d

fused_rotary_qk_concat_kv

14670f4

auto format by CI

ffcaa1b

debug_llama_fuse_three_matmul

62d9e03

refine

05ec5c1

refine

4365ff7

rm init_ptr for group_matmul

12a8a1e

support_local

d4d5218

ofhwei force-pushed the fused_llama_kernel branch from 3470229 to d4d5218 Compare May 25, 2023 12:25

clackhan reviewed May 26, 2023

View reviewed changes

update AddResidualLoad for rms norm

2ce2f6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused llama kernel #10266

Fused llama kernel #10266

ofhwei commented May 15, 2023

clackhan May 16, 2023

clackhan May 16, 2023

clackhan May 16, 2023 •

edited

yuanms2 commented May 16, 2023

clackhan commented May 16, 2023 •

edited

strint commented May 16, 2023

clackhan commented May 16, 2023

ofhwei commented May 16, 2023

github-actions bot commented May 22, 2023

clackhan May 26, 2023

		std::map<int, std::shared_ptr<OpExpr>> ops_;
		std::map<int, std::shared_ptr<OpExpr>> ops_with_past_key_value_;

Fused llama kernel #10266

Are you sure you want to change the base?

Fused llama kernel #10266

Conversation

ofhwei commented May 15, 2023

clackhan May 16, 2023

Choose a reason for hiding this comment

clackhan May 16, 2023

Choose a reason for hiding this comment

clackhan May 16, 2023 • edited

Choose a reason for hiding this comment

yuanms2 commented May 16, 2023

clackhan commented May 16, 2023 • edited

strint commented May 16, 2023

clackhan commented May 16, 2023

ofhwei commented May 16, 2023

github-actions bot commented May 22, 2023

clackhan May 26, 2023

Choose a reason for hiding this comment

clackhan May 16, 2023 •

edited

clackhan commented May 16, 2023 •

edited