Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xpu: provide a way to debug explicit CPU fallback #126488

Open
dvrogozh opened this issue May 17, 2024 · 7 comments
Open

xpu: provide a way to debug explicit CPU fallback #126488

dvrogozh opened this issue May 17, 2024 · 7 comments
Assignees
Labels
module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@dvrogozh
Copy link
Contributor

dvrogozh commented May 17, 2024

@fengyuan14 - The commit intel/torch-xpu-ops@5bf9e0c muted debug logs of "explicit" CPU fallbacks. This complicated debug for 3d party contributors trying to evaluate XPU backend capabilities - now I am forced to revert noted commit to understand which operations are not currently implemented by XPU. Please:

  1. Explain what "explicit CPU fallback" means - this seems to be internal to xpu team classification which is unclear and confusing
  2. Extend PYTORCH_DEBUG_XPU_FALLBACK=1 to track any CPU fallback happening in XPU backend. Note: I am fine if "explicit" fallback will be muted by default, but I really need a way to be able to track it.
commit 5bf9e0cc768f7a3b13d829118683275f324399f1 (origin/meng_max_2d)
Author: Feng Yuan <feng1.yuan@intel.com>
Date:   Mon Apr 29 13:05:51 2024 +0800

    Register operator's implementation lazily. (#177)

    1. Avoid dangling operator's implementation (m.impl(torchvision::nms) is
    ahead of `import torchvision` sometime)
    2. Mute debug log of explicit CPU fallback.
    3. Add torchvision.roi_align/_roi_align_backward example case

CC: @jgong5 @mingfeima @XiaobingSuper @ashokei @jingxu10 @gujinghui @EikanWang @fengyuan14 @guangyey

cc @gujinghui @EikanWang @fengyuan14 @guangyey

@dvrogozh
Copy link
Contributor Author

Also filed intel/torch-xpu-ops#262

@dvrogozh
Copy link
Contributor Author

Note: I am fine if "explicit" fallback will be muted by default, but I really need a way to be able to track it.

I still want to comment on that. I personally will be fine with muted logs on fallback by default because I know that currently there are a number of operations not yet implemented in XPU. However, I argue that for other people who just spotted and want to try XPU backend and having limited knowledge on it - for these people such muted behavior might be a problem. They will spot immediately that XPU backend significantly underperforms, sometimes even compared to CPU, and they won't have any obvious reason at hand why. Log messages with warnings that CPU fallback is happening were quote handy here - they were setting correct impression that currently XPU backend might underperform.

My recommendation is to always print a debug message that CPU fallback is happening regardless whether it's explicit (whatever this means) or implicit.

@fengyuan14
Copy link
Collaborator

Got your requirement. In my understanding, the log is not informative for DL workload customers. It should be a debugging requirement.

As to release build, we would keep existing implementation. I think, we could add the feature in debug build.

@guangyey guangyey added module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 17, 2024
@fengyuan14
Copy link
Collaborator

@EikanWang Please comment.

@dvrogozh
Copy link
Contributor Author

As to release build, we would keep existing implementation. I think, we could add the feature in debug build.

Can you, please, have this feature controlled by environment variable, let's say same as before - PYTORCH_DEBUG_XPU_FALLBACK=1? In this case you can have it disabled by default for Release build and enabled by default for Debug build. Then, end user can decide whether he want it enabled for Release build or disabled for debug via environment variable.

@dvrogozh
Copy link
Contributor Author

I opened intel/torch-xpu-ops#318 with the implementation I propose (which is - always warn on cpu fallback :) ). Let's continue discussion in the PR.

@EikanWang
Copy link
Collaborator

We will close the issue as long as the PR is landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants