-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inductor] support masked vectorization for the tail_loop #126526
base: gh/jiayisunx/10/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126526
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (6 Unrelated Failures)As of commit b72e62a with merge base bf2909b (): FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 4e2aa6dfafd14ce90a1dc5b91cb4cbd59ad35628 Pull Request resolved: #126526
ghstack-source-id: 455564d916d1df7e8ff02a6ec56333e4925995d0 Pull Request resolved: #126526
ghstack-source-id: 44ba52dbbdac02b1b94a8b6c5c519bbfd85098c7 Pull Request resolved: #126526
ghstack-source-id: 978570fd0bcb1d06151639c9cc9443707a1457dd Pull Request resolved: #126526
ghstack-source-id: 6d8d821a710fe38916d82149fd6a4947d33c5447 Pull Request resolved: #126526
ghstack-source-id: e0c28ce4cd364062091b7ac139b89439f25ac48b Pull Request resolved: #126526
ghstack-source-id: 09ad78c02f454ae64aed6e2e6c130194a68fc419 Pull Request resolved: #126526
ghstack-source-id: d468379dceab2966884ac812b9a64817151e9002 Pull Request resolved: #126526
ghstack-source-id: 0ab10c9a6daa57f8a6e2264cf069ac341cd05edf Pull Request resolved: #126526
ghstack-source-id: 22a8a067dacb9b6212864aef39fc2bbdff98bd54 Pull Request resolved: #126526
ghstack-source-id: da65b846b7bda00dfba25164b23ab0258a7418b1 Pull Request resolved: #126526
ghstack-source-id: 0a0eee47774df2936bc6a52ee1177416408187f6 Pull Request resolved: #126526
ghstack-source-id: 57ca43d0ef1f782aa091cbd29c50f549a880c5aa Pull Request resolved: pytorch#126526
ghstack-source-id: 57ca43d0ef1f782aa091cbd29c50f549a880c5aa Pull Request resolved: pytorch#126526
ghstack-source-id: 7532cf936ddf8e29b459b4a137b9375cb105462a Pull Request resolved: #126526
ghstack-source-id: 13de524dbe3f6030d3386363d5071630f9dd8910 Pull Request resolved: #126526
ghstack-source-id: a049492e711bb7ce726936844d83a490d12ec811 Pull Request resolved: #126526
ghstack-source-id: 687c03d321edf1334e11dc9beec54dd98ae289e6 Pull Request resolved: #126526
self.supported_dtypes_for_masked_vec: List[torch.dtype] = [ | ||
torch.float, | ||
torch.bfloat16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to limit this? Can we treat it same as normal vec check? We are simplifying the vec checker and target removing it entirely some day. Not good to further complicate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, some operations on integer types do not support masked vectorization well, so these data types cannot yet support masked vectorization for the tail_loop. I will try to solve these issues in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that it will be removed in the near future after we support all data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, do you mind stack a PR now to support all data types?
torch/_inductor/codegen/cpp_prefix.h
Outdated
|
||
template <typename T> | ||
T reduce(const T& a, const T& b, const std::string& reduction_type) { | ||
if (reduction_type == "max") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially moves the compile time checks to runtime. I don't think it is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified this part, please review it again, thanks!
torch/_inductor/codegen/cpp_prefix.h
Outdated
// Guard against division by zero | ||
wb_over_w = T::blendv(wb_over_w, T(0), new_weight == T(0)); | ||
auto new_mean = a.mean + delta * wb_over_w; | ||
auto new_m2 = a.m2 + b.m2 + delta * delta * a.weight * wb_over_w; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid code dedup between tail version and main version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
torch/_inductor/codegen/cpp_prefix.h
Outdated
: delta * w->weight_recps[new_index]); | ||
} | ||
auto new_delta = data - new_mean; | ||
auto new_m2 = acc.m2 + delta * new_delta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
torch/_inductor/codegen/cpp.py
Outdated
if self.tiling_idx >= self.reduction_depth: | ||
# calculate the reduction size that will be vectorized | ||
reduction_inner_size = ( | ||
self.ranges[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.ranges[-1]
holds for either of the conditions, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the assumption that self.tiling_idx == len(sef.ranges) - 1
so that the vectorization happens on the inner-most loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, vectorization happens on the inner-most loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to what has been observed so far, self.ranges[-1]
holds for either of the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, why can't we just do self.ranges[-1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, vectorization happens on the inner-most loop.
To be safe, can we add an assertion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, please review it again, thanks!
torch/_inductor/codegen/cpp.py
Outdated
else self.ranges[self.reduction_depth] | ||
) | ||
# calculate loops size outside the vectorized loop | ||
self.reduction_outer_size = reduction_size / reduction_inner_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.reduction_outer_size = reduction_size / reduction_inner_size | |
self.reduction_outer_size = reduction_size // reduction_inner_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
ghstack-source-id: 73e336bcaca9297efc513763d989766e6ae1c857 Pull Request resolved: #126526
ghstack-source-id: 0718b9f54a25cd5adec7acefd614c736a7835c19 Pull Request resolved: #126526
ghstack-source-id: cacf3e8267b8f7578f407bf91ff8fc9be3f80988 Pull Request resolved: #126526
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think we need the UTs to check the vecmask used for tail loop.
-
Will it be clearer to add a new subclass maybe
CppVecMaskKernel(CppVecKernel)
and lift up common code fromCppVecKernel
as a method to be overwritten or reuse?
int64_t outer_size; | ||
int64_t main_size; | ||
int64_t tail_size; | ||
std::vector<T> weight_recps; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we change the type from T::value_type
to T
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use T , i.e., vec type instead of scalar type, as all element values of weight(vec type) may no longer be identical in masked vec welford reduce since weight may be masked.
ghstack-source-id: b6674cd6c2f3e2624afa4e73997aaa3407f11597 Pull Request resolved: #126526
Added, thanks!
I don't have a strong opinion, but adding a new subclass might introduce some code duplication, @jgong5, do you have any opinion? |
I guess what @leslie-fang-intel meant was to factor out some functions from |
Stack from ghstack (oldest at bottom):
Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance.
Generated code:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang