Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACS SCT crash in SetWatchdogTimer_Func RaiseTPL #216

Open
eric-ch opened this issue Apr 26, 2022 · 5 comments
Open

ACS SCT crash in SetWatchdogTimer_Func RaiseTPL #216

eric-ch opened this issue Apr 26, 2022 · 5 comments

Comments

@eric-ch
Copy link

eric-ch commented Apr 26, 2022

Hi,

I'm seeing an assert trigger when running the ACS SCT BBR test SetWatchdogTimer_Func:

FATAL ERROR - RaiseTpl with OldTpl(0x10) > NewTpl(0x8)
ASSERT [DxeCore] ./edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c(66): ((BOOLEAN)(0==1))

Looking at the backtrace, DisconnectAll calls DisconnectController callbacks and ends up invoking UsbBusControllerDriverStop. IIUC the WatchdogTimerEvent runs at TPL_NOTIFY which corroborates the assert message.

#0  CoreRaiseTpl (NewTpl=8) at ./edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c:65
#1  0x0000000033c4932c in UsbBusControllerDriverStop (This=0x33c4e0c0, Controller=0x35f65e18, NumberOfChildren=4, ChildHandleBuffer=0x35f65d18) at ./edk2/MdeModulePkg/Bus/Usb/UsbBusDxe/UsbBus.c:1426
#2  0x000000003a4405c8 in CoreDisconnectController (ControllerHandle=ControllerHandle@entry=0x35f65e18, DriverImageHandle=0x35f6a318, ChildHandle=<optimized out>, ChildHandle@entry=0x0) at ./edk2/MdeModulePkg/Core/Dxe/Hand/DriverSupport.c:957
#3  0x000000003a4406f0 in CoreDisconnectControllersUsingProtocolInterface (UserHandle=UserHandle@entry=0x35f65e18, Prot=0x363f0918) at ./edk2/MdeModulePkg/Core/Dxe/Hand/Handle.c:631
#4  0x000000003a44090c in CoreUninstallProtocolInterface (UserHandle=0x35f65e18, Protocol=0x36d60078, Interface=0x35f5d598) at ./edk2/MdeModulePkg/Core/Dxe/Hand/Handle.c:730
#5  0x000000003a4421f0 in CoreUninstallMultipleProtocolInterfaces (Handle=<optimized out>) at ./edk2/MdeModulePkg/Core/Dxe/Hand/Handle.c:835
#6  0x0000000036d4f3b4 in DeRegisterPciDevice (Controller=Controller@entry=0x35f97c18, Handle=0x35f65e18) at ./edk2/MdeModulePkg/Bus/Pci/PciBusDxe/PciDeviceSupport.c:524
#7  0x0000000036d4f420 in DeRegisterPciDevice (Controller=Controller@entry=0x35f97c18, Handle=0x35f66c98) at ./edk2/MdeModulePkg/Bus/Pci/PciBusDxe/PciDeviceSupport.c:488
#8  0x0000000036d4f630 in PciBusDriverBindingStop (This=<optimized out>, Controller=0x35f97c18, NumberOfChildren=2, ChildHandleBuffer=0x353aeb98) at ./edk2/MdeModulePkg/Bus/Pci/PciBusDxe/PciBus.c:447
#9  0x000000003a4405c8 in CoreDisconnectController (ControllerHandle=0x35f97c18, DriverImageHandle=0x35f6ea18, ChildHandle=<optimized out>) at ./edk2/MdeModulePkg/Core/Dxe/Hand/DriverSupport.c:957
#10 0x00000000371314c8 in DisconnectAll () at ./edk2-platforms/Platform/RaspberryPi/Library/ResetLib/ResetLib.c:60
#11 LibResetSystem (ResetType=EfiResetCold, ResetData=<optimized out>, DataSize=<optimized out>, ResetStatus=<optimized out>) at ./edk2-platforms/Platform/RaspberryPi/Library/ResetLib/ResetLib.c:96
#12 0x000000003a44b7b4 in CoreDispatchEventNotifies (Priority=16) at ./edk2/MdeModulePkg/Core/Dxe/Event/Event.c:186
#13 CoreRestoreTpl (NewTpl=4) at ./edk2/MdeModulePkg/Core/Dxe/Event/Tpl.c:133
#14 0x0000000037088bf0 in ExceptionHandlersEnd ()

There is a mention UsbBusControllerDriverStop may not be at the right priority level. Trying to use USB_BUS_TPL ( == TPL_NOTIFY) just moves the issue in the next driver (EfiIp6ConfigGetData, Ip4ServiceBindingDestroyChild, ...) to reach the same assert for what appears to be the same reason.

I tried a naive work-around to confirm this. It ran the ACS SCT BBR tests without triggering the assert, which might as well run the WatchdogTimerEvent at TPL_CALLBACK since DisconnectAll is also reached through other paths with nothing to do with the WatchdogTimerEvent?

How should this be approached for a proper fix?

@TheMindVirus
Copy link

From past experience, if a Watchdog Timer is throwing more errors than it solves, it should be switched off and a redundant always-on machine or dedicated power management circuit should be used instead for when the machine stops responding.

If you absolutely still need the Watchdog to work, it is expected that on slower machines the drivers are going to take longer to initialise, so maybe an Increased Timeout is an alternative solution (then seeing which other driver brings up the assert if any).

There could also be a better approach that hasn't come to mind yet. Your workaround seems to effectively perform part of option 1 and the variable for option 2 is in a different place every time called something different every time (proprietary).

@eric-ch
Copy link
Author

eric-ch commented Apr 28, 2022

@TheMindVirus Thank you for your response!

The Watchdog above is used as part of a test unit in the ACS EBBR or SBBR to test for compliance.

I am not sure I understand the timeout reference, my apologies. AFAIU, the assert will always trigger on the Pi 4 when running any BBR test sequence since the WatchdogTimerEvent will always run at TPL_NOTIFY and as long as we still have Boot Services, DisconnectAll() always end up calling a DisconnectController for a device that will try to raise the TPL to a lower level on the way to stop some driver. For example:

These ones reliably trigger (Ipv6ConfigGetData after I change UsbBusControllerDriverStop to run at USB_BUS_TPL, Ip4ServiceBindingDestroyChild after I do the same change for Ipv6ConfigGetData). At the same time, DisconnectAll() under these conditions looks legitimate to me, which tells me I am missing something here.

@TheMindVirus
Copy link

The Watchdog Timeout can be considered like setting an Alarm Clock, of which there are Raspberry Pi EEPROM/Firmware settings to set the values for: https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#DHCP_TIMEOUT

As for the priority levels, if they are anything like Windows Driver's IRQL's and IRQL_NOT_LESS_OR_EQUAL, they also cause more issues than they solve. A routine can and should check that the priority level it wants to set isn't less than or equal to the current priority level before trying to raise it.

On Multiprocessing systems, the routine can also wait in a loop (with a timeout counter) for another thread to complete its higher priority task before continuing with the lower priority task. The ever increasing priority level is not a solution for this and it has to be reset at some point anyway for it to be useful again. Task completion flags can bloat the system, so instead a simple Busy status flag (or sometimes spinlock) is preferred.

@jlinton
Copy link
Member

jlinton commented Jun 17, 2022

Yah, I think the core edk2 TPL raise/lower code isn't correct. I have a fix around here somewhere, because PhatFree pointed out a whole bunch of errors trip in ACS when run against a debug build. Most of which were core edk2 problems (like this tpl code, and there are issues around page zeroing/etc).

I'm sure the edk2 people are open to patches, but at this point, if your trying to run ACS to test the platform its probably better to start with a RELEASE build.

@eric-ch
Copy link
Author

eric-ch commented Jul 6, 2022

@jlinton Thank you for the answer! I have not looked at this in some time. The error was with a RELEASE build, assuming that's the same error since the hang was silent without the ASSERT output, which I then rebuilt to DEBUG to get the output provided, unless you are referring to pre-built ACS binaries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants