Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on Linux Nvidia 550 driver #12542

Merged
merged 5 commits into from Mar 30, 2024

Conversation

eero-lehtinen
Copy link
Contributor

@eero-lehtinen eero-lehtinen commented Mar 17, 2024

Objective

Fix crashing on Linux with latest stable Nvidia 550 driver when resizing. The crash happens at startup with some setups.

Fixes #12199

I think this would be nice to get into 0.13.1

Solution

Ignore wgpu::SurfaceError::Outdated always on this platform+driver.

It looks like Nvidia considered the previous behaviour of not returning this error a bug:
"Fixed a bug where vkAcquireNextImageKHR() was not returning VK_ERROR_OUT_OF_DATE_KHR when it should with WSI X11 swapchains" (https://www.nvidia.com/Download/driverResults.aspx/218826/en-us/)

What I gather from this is that the surface was outdated on previous drivers too, but they just didn't report it as an error. So behaviour shouldn't change.

In the issue conversation we experimented with calling continue when this error happens, but I found that it results in some small issues like bevy_egui scale not updating with the window sometimes. Just doing nothing seems to work better.

Changelog

  • Fixed crashing on Linux with Nvidia 550 driver when resizing the window

Migration Guide

Copy link
Contributor

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

@eero-lehtinen
Copy link
Contributor Author

eero-lehtinen commented Mar 17, 2024

Looks like I'm also undoing a change from #12055, where configure_surface was removed from being called on resize. I don't have an iPhone to test this change with.

@SolarLiner SolarLiner added A-Rendering Drawing game state to the screen O-Linux Specific to the Linux desktop operating system labels Mar 17, 2024
@eero-lehtinen
Copy link
Contributor Author

CI timed out but logs say that test was successful. Is this a real error?

@WarrenHood
Copy link

WarrenHood commented Mar 18, 2024

Why not just always ignore surface outdated errors? Rather than only for linux with nvidia 550 drivers

@eero-lehtinen
Copy link
Contributor Author

Why not just always ignore surface outdated errors? Rather than only for linux with nvidia 550 drivers

The error is not expected anywhere else. But yeah always ignoring it would reduce the amount of code. I'm not opposed to doing it that way.

@eero-lehtinen
Copy link
Contributor Author

CI timed out but logs say that test was successful. Is this a real error?

The error was real but now fixed.

@notverymoe
Copy link
Contributor

notverymoe commented Mar 19, 2024

I'm concerned about specifically checking for the 550 driver. This seems to be a fix by Nvidia, so 560, 570 etc. will contain this behaviour. And given that the error was effectively ignored in versions prior to 550, I think this should be safe to just ignore on Nvidia Linux as a whole?

@eero-lehtinen
Copy link
Contributor Author

I'm concerned about specifically checked for the 550 driver. This seems to be a fix by nvidia, so 560, 570 etc. will contain this behaviour. And given that the error was effectively ignored in versions prior to 550, I think this should be safe to just ignore on Nvidia Linux as a whole?

Yeah it was a panic before so it shouldn't be too bad to ignore it. I'll do that.

Copy link
Contributor

@torsteingrindvik torsteingrindvik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a windowing expert, but cherry-picking these commits solved various crashes I get since I'm on linux + new Nvidia drivers so I'm very happy with this.

@IceSentry
Copy link
Contributor

Why did you remove the not_already_configured related code? It seems unrelated to me?

@eero-lehtinen
Copy link
Contributor Author

eero-lehtinen commented Mar 23, 2024

Why did you remove the not_already_configured related code? It seems unrelated to me?

After #12055 both branches do the same thing because configure_surface was removed. I didn't want to copy paste my check for both branches, so I removed the first. But I guess it's not this PRs job to fix, so I can also make my changes as minimal as possible.

@eero-lehtinen eero-lehtinen force-pushed the fix-nvidia-550-crash branch 2 times, most recently from 584f230 to 3eef57d Compare March 23, 2024 01:45
@james7132 james7132 requested a review from mockersf March 23, 2024 22:20
@alice-i-cecile alice-i-cecile added this to the 0.13.2 milestone Mar 23, 2024
Copy link
Contributor

@IQuick143 IQuick143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say anything about the changes themselves as I don't understand windowing code, but testing on my machine I can confirm this does resolve the issue. (Void Linux, X.Org, nvidia-550.54.14_1 driver)

@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 23, 2024
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently absorbing the error seems like a future issue waiting to happen. Can we downgrade this to an error or warning log?

@alice-i-cecile alice-i-cecile removed the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 23, 2024
@eero-lehtinen
Copy link
Contributor Author

eero-lehtinen commented Mar 24, 2024

Silently absorbing the error seems like a future issue waiting to happen. Can we downgrade this to an error or warning log?

I added a debug level log because it happens pretty often and there is no way to fix it.

@james7132
Copy link
Member

I added a debug level log because it happens pretty often and there is no way to fix it.

An error_once or warn_once should log it once and then black hole the rest.

render_instance
.enumerate_adapters(wgpu::Backends::VULKAN)
.iter()
.any(|adapter| adapter.get_info().name.starts_with("NVIDIA"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if starts_with() is good enough here. We've had a bug somewhere else because the vendor name wasn't at the start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the name field there seems to be driver that just says "NVIDIA" and nothing else. Would that be better? I don't have other cards to test with.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how quirky these AdapterInfo strings can get but maybe just changing it to name.to_uppercase().contains("NVIDIA") could work. You could also add a fallback check of the driver string to account for cases in which the name would not match on an NVIDIA device, no idea if this would be necessary though. This is out of the scope of this PR but the same starts_with() method is currently used in defining may_erroneously_timeout in the lines above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add random extra checks if we don't know what edge case we are handling. This has worked fine for everyone who tested it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for vendor id would probably be the best for reliability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm with a little test system that this spams true:
(EDIT: I have an nvidia card of course..)

fn vendor_info_hex(adapter: Res<RenderAdapter>) {
    let nvidia = 0x10de;
    let vendor = adapter.get_info().vendor;

    info!("Is nvidia? {}", vendor == nvidia);
}

found the ID value here: https://envytools.readthedocs.io/en/latest/hw/pciid.html#introduction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to wgpu there might be other bytes non-zero: https://docs.rs/wgpu/latest/wgpu/struct.AdapterInfo.html#structfield.vendor

in the ID, so might need to mask off only the 16 bits we care about

let vendor = adapter.get_info().vendor & 0xFFFF;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating. I made this change. I just compare to a constant, but an enum based abstraction for the vendor check would be better in the future and should be used everywhere instead of strings.

@matiqo15 matiqo15 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 28, 2024
@james7132 james7132 added this pull request to the merge queue Mar 30, 2024
Merged via the queue into bevyengine:main with commit 70c69cd Mar 30, 2024
27 checks passed
mockersf pushed a commit that referenced this pull request Apr 1, 2024
# Objective

Fix crashing on Linux with latest stable Nvidia 550 driver when
resizing. The crash happens at startup with some setups.

Fixes #12199

I think this would be nice to get into 0.13.1

## Solution

Ignore `wgpu::SurfaceError::Outdated` always on this platform+driver.

It looks like Nvidia considered the previous behaviour of not returning
this error a bug:
"Fixed a bug where vkAcquireNextImageKHR() was not returning
VK_ERROR_OUT_OF_DATE_KHR when it should with WSI X11 swapchains"
(https://www.nvidia.com/Download/driverResults.aspx/218826/en-us/)

What I gather from this is that the surface was outdated on previous
drivers too, but they just didn't report it as an error. So behaviour
shouldn't change.

In the issue conversation we experimented with calling `continue` when
this error happens, but I found that it results in some small issues
like bevy_egui scale not updating with the window sometimes. Just doing
nothing seems to work better.

## Changelog

- Fixed crashing on Linux with Nvidia 550 driver when resizing the
window

## Migration Guide

---------

Co-authored-by: James Liu <contact@jamessliu.com>
mockersf pushed a commit that referenced this pull request Apr 1, 2024
# Objective

Fix crashing on Linux with latest stable Nvidia 550 driver when
resizing. The crash happens at startup with some setups.

Fixes #12199

I think this would be nice to get into 0.13.1

## Solution

Ignore `wgpu::SurfaceError::Outdated` always on this platform+driver.

It looks like Nvidia considered the previous behaviour of not returning
this error a bug:
"Fixed a bug where vkAcquireNextImageKHR() was not returning
VK_ERROR_OUT_OF_DATE_KHR when it should with WSI X11 swapchains"
(https://www.nvidia.com/Download/driverResults.aspx/218826/en-us/)

What I gather from this is that the surface was outdated on previous
drivers too, but they just didn't report it as an error. So behaviour
shouldn't change.

In the issue conversation we experimented with calling `continue` when
this error happens, but I found that it results in some small issues
like bevy_egui scale not updating with the window sometimes. Just doing
nothing seems to work better.

## Changelog

- Fixed crashing on Linux with Nvidia 550 driver when resizing the
window

## Migration Guide

---------

Co-authored-by: James Liu <contact@jamessliu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen O-Linux Specific to the Linux desktop operating system S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resizing causes a crash on latest stable Nvidia driver 550