Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inductor] support masked vectorization for the tail_loop #126526

Open
wants to merge 24 commits into
base: gh/jiayisunx/10/base
Choose a base branch
from

Conversation

jiayisunx
Copy link
Collaborator

@jiayisunx jiayisunx commented May 17, 2024

Stack from ghstack (oldest at bottom):

Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance.

Generated code:

  • Before:
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(67L));
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(36L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (90L*x2) + (3240L*x0)), 16);
                            auto tmp1 = at::vec::convert<float>(tmp0);
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp1, &weight_recps);
                        }
                        #pragma omp simd simdlen(8)
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<long>(x3 + (30L*x1) + (90L*x2) + (3240L*x0))];
                            auto tmp1 = c10::convert<float>(tmp0);
                            tmp_acc0 = welford_combine(tmp_acc0, tmp1);
                        }
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x1 + (3L*x0))] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x1 + (3L*x0))] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }
  • After:
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(36L), static_cast<long>(16L), static_cast<long>(14L));
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(36L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (90L*x2) + (3240L*x0)), 16);
                            auto tmp1 = at::vec::convert<float>(tmp0);
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp1, &weight_recps);
                        }
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                        {
                            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (90L*x2) + (3240L*x0)), 14);
                            auto tmp1 = at::vec::convert<float>(tmp0);
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp1, 14, &weight_recps);
                        }
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x1 + (3L*x0))] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x1 + (3L*x0))] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]
Copy link

pytorch-bot bot commented May 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126526

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (6 Unrelated Failures)

As of commit b72e62a with merge base bf2909b (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jiayisunx added a commit that referenced this pull request May 17, 2024
ghstack-source-id: 4e2aa6dfafd14ce90a1dc5b91cb4cbd59ad35628
Pull Request resolved: #126526
@jiayisunx jiayisunx marked this pull request as draft May 17, 2024 09:17
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 20, 2024
ghstack-source-id: 455564d916d1df7e8ff02a6ec56333e4925995d0
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 20, 2024
ghstack-source-id: 44ba52dbbdac02b1b94a8b6c5c519bbfd85098c7
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 20, 2024
ghstack-source-id: 978570fd0bcb1d06151639c9cc9443707a1457dd
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 20, 2024
ghstack-source-id: 6d8d821a710fe38916d82149fd6a4947d33c5447
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 21, 2024
ghstack-source-id: e0c28ce4cd364062091b7ac139b89439f25ac48b
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 21, 2024
ghstack-source-id: 09ad78c02f454ae64aed6e2e6c130194a68fc419
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 21, 2024
ghstack-source-id: d468379dceab2966884ac812b9a64817151e9002
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 22, 2024
ghstack-source-id: 0ab10c9a6daa57f8a6e2264cf069ac341cd05edf
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 22, 2024
ghstack-source-id: 22a8a067dacb9b6212864aef39fc2bbdff98bd54
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 22, 2024
ghstack-source-id: da65b846b7bda00dfba25164b23ab0258a7418b1
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 23, 2024
ghstack-source-id: 0a0eee47774df2936bc6a52ee1177416408187f6
Pull Request resolved: #126526
[ghstack-poisoned]
CaoE pushed a commit to CaoE/pytorch that referenced this pull request May 26, 2024
ghstack-source-id: 57ca43d0ef1f782aa091cbd29c50f549a880c5aa
Pull Request resolved: pytorch#126526
CaoE pushed a commit to CaoE/pytorch that referenced this pull request May 26, 2024
ghstack-source-id: 57ca43d0ef1f782aa091cbd29c50f549a880c5aa
Pull Request resolved: pytorch#126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 27, 2024
ghstack-source-id: 7532cf936ddf8e29b459b4a137b9375cb105462a
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 27, 2024
ghstack-source-id: 13de524dbe3f6030d3386363d5071630f9dd8910
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 27, 2024
ghstack-source-id: a049492e711bb7ce726936844d83a490d12ec811
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 27, 2024
ghstack-source-id: 687c03d321edf1334e11dc9beec54dd98ae289e6
Pull Request resolved: #126526
@jiayisunx jiayisunx marked this pull request as ready for review May 27, 2024 03:02
Comment on lines +2818 to +2820
self.supported_dtypes_for_masked_vec: List[torch.dtype] = [
torch.float,
torch.bfloat16,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to limit this? Can we treat it same as normal vec check? We are simplifying the vec checker and target removing it entirely some day. Not good to further complicate it.

Copy link
Collaborator Author

@jiayisunx jiayisunx May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, some operations on integer types do not support masked vectorization well, so these data types cannot yet support masked vectorization for the tail_loop. I will try to solve these issues in the near future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here that it will be removed in the near future after we support all data types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do you mind stack a PR now to support all data types?


template <typename T>
T reduce(const T& a, const T& b, const std::string& reduction_type) {
if (reduction_type == "max") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially moves the compile time checks to runtime. I don't think it is the right thing to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this part, please review it again, thanks!

// Guard against division by zero
wb_over_w = T::blendv(wb_over_w, T(0), new_weight == T(0));
auto new_mean = a.mean + delta * wb_over_w;
auto new_m2 = a.m2 + b.m2 + delta * delta * a.weight * wb_over_w;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid code dedup between tail version and main version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

: delta * w->weight_recps[new_index]);
}
auto new_delta = data - new_mean;
auto new_m2 = acc.m2 + delta * new_delta;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if self.tiling_idx >= self.reduction_depth:
# calculate the reduction size that will be vectorized
reduction_inner_size = (
self.ranges[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.ranges[-1] holds for either of the conditions, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the assumption that self.tiling_idx == len(sef.ranges) - 1 so that the vectorization happens on the inner-most loop?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, vectorization happens on the inner-most loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to what has been observed so far, self.ranges[-1] holds for either of the conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why can't we just do self.ranges[-1]?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, vectorization happens on the inner-most loop.

To be safe, can we add an assertion here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, please review it again, thanks!

else self.ranges[self.reduction_depth]
)
# calculate loops size outside the vectorized loop
self.reduction_outer_size = reduction_size / reduction_inner_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.reduction_outer_size = reduction_size / reduction_inner_size
self.reduction_outer_size = reduction_size // reduction_inner_size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@jiayisunx jiayisunx added the ciflow/trunk Trigger trunk jobs on your pull request label May 27, 2024
@CaoE CaoE added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 27, 2024
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 27, 2024
ghstack-source-id: 73e336bcaca9297efc513763d989766e6ae1c857
Pull Request resolved: #126526
@jiayisunx jiayisunx requested a review from jgong5 May 28, 2024 01:07
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 28, 2024
ghstack-source-id: 0718b9f54a25cd5adec7acefd614c736a7835c19
Pull Request resolved: #126526
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 29, 2024
ghstack-source-id: cacf3e8267b8f7578f407bf91ff8fc9be3f80988
Pull Request resolved: #126526
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think we need the UTs to check the vecmask used for tail loop.

  • Will it be clearer to add a new subclass maybe CppVecMaskKernel(CppVecKernel) and lift up common code from CppVecKernel as a method to be overwritten or reuse?

int64_t outer_size;
int64_t main_size;
int64_t tail_size;
std::vector<T> weight_recps;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we change the type from T::value_type to T?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use T , i.e., vec type instead of scalar type, as all element values of weight(vec type) may no longer be identical in masked vec welford reduce since weight may be masked.

[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request May 30, 2024
ghstack-source-id: b6674cd6c2f3e2624afa4e73997aaa3407f11597
Pull Request resolved: #126526
@jiayisunx
Copy link
Collaborator Author

  • I think we need the UTs to check the vecmask used for tail loop.

Added, thanks!

  • Will it be clearer to add a new subclass maybe CppVecMaskKernel(CppVecKernel) and lift up common code from CppVecKernel as a method to be overwritten or reuse?

I don't have a strong opinion, but adding a new subclass might introduce some code duplication, @jgong5, do you have any opinion?

@jgong5
Copy link
Collaborator

jgong5 commented May 30, 2024

I don't have a strong opinion, but adding a new subclass might introduce some code duplication, @jgong5, do you have any opinion?

I guess what @leslie-fang-intel meant was to factor out some functions from CppVecKernel to be overridden by CppVecMaskKernel, e.g., how the "load", "store" and "reduction" lines are generated. This can avoid code duplication you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request module: inductor open source release notes: fx release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants