Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using FreeRTOS-Kernel-Heap3 library leads to deadlock across cores, race condition on core spinlock #1646

Open
gemarcano opened this issue Feb 23, 2024 · 7 comments
Labels
Milestone

Comments

@gemarcano
Copy link

I describe the full issue here: https://forums.freertos.org/t/multicore-smp-rp2040-port-deadlock-when-using-heap-3/19255/

In summary, FreeRTOS heap_3 calls vTaskSuspendAll which grabs the task lock before calling malloc/free. At the same time, if another task is using good-ol' malloc/free, a race takes place. Under FreeRTOS, the rp2040 port uses some macro magic to inject FreeRTOS support into the mutex functions. This implementation tries to grab the task lock also. heap_3 takes the FreeRTOS task lock before grabbing malloc_mutex, while a normal call to malloc takes the core synchronization spinlock first before grabbing the FreeRTOS task lock. This leads to a deadlock between the two tasks.

For clarity, let's say core 0 is calling into FreeRTOS heap_3, and core 1 is calling malloc (which is wrapped by pico-sdk). This is the sequence of events:

  1. core 0: vTaskSuspendAll in vPortFree before calling free
  2. core 1: calls malloc
  3. core 1: in mutex_enter_blocking with malloc_mutex, grabs core synchronization spinlock
  4. core 0 calls free
  5. core 0: in mutex_exit with malloc_mutex, blocks waiting for core synchronization spinlock
  6. core 1: calls lock_internal_spin_unlock_with_wait which is a macro provided by FreeRTOS rp2040 port
  7. core 1: tries to grab task lock with an eventual call to vTaskEnterCritical after macro hell
  8. deadlock, core 1 is holding onto the core synchronization spinlock, core 0 has the task lock.

This happens because FreeRTOS heap_3 is trying to be multi task safe, but the pico-sdk's malloc and free wrappers are already providing this. I don't know of a good way to fix/clean this up, as it will likely need better cooperation between the rp2040 port and FreeRTOS itself.

My current workaround is to waste more SRAM and just use heap_4 and let FreeRTOS manage its own heap for its own objects, and let tasks use the "regular" heap that uses pico-sdk locking.

@gemarcano
Copy link
Author

I didn't make it clear, but I am seeing this on an actual project, so it's not just a hypothetical.

gemarcano added a commit to gemarcano/snes_controllers_to_usb that referenced this issue Feb 23, 2024
 - There's a race condition with heap_3 FreeRTOS between heap_3 and
   pico-sdk mutex implementation. See
   raspberrypi/pico-sdk#1646 for more
   information.
@peterharperuk
Copy link
Contributor

Can you post what version of freertos you're using?
And any chance of an example that demonstrates the problem?

@gemarcano
Copy link
Author

gemarcano commented Feb 24, 2024

This is the last commit of a personal project before I switched to heap_4 to work around the issue:
gemarcano/snes_controllers_to_usb@a12eee5

EDIT: use this commit, and switch from heap_4 to heap_3 in the CMakeLists.txt: gemarcano/snes_controllers_to_usb@aa4e8ff . The original commit didn't have the new/malloc calls in place yet.

I'm using this branch of FreeRTOS (from this PR that fixes a bug I found: FreeRTOS/FreeRTOS-Kernel#991) which should be pretty close to FreeRTOS master.

I'm also using a patch from #1530 to the pico-sdk master branch so it can work with FreeRTOS SMP support (that PR really needs to get merged). And I also found a bug in tinyusb, and I have a (crappy) PR pending here that I'm also using.

Seeing that the bug is a race condition, I'm not sure how to reduce my project to something sane that might still trigger the bug. It seems pretty reliable to trigger at home (watchdog is firing all the time unless I disable it), but it's still a race.

As I can reproduce it pretty reliably with that commit, I can dump any data structures using GDB from both cores. In the FreeRTOS forum post, I already posted the backtrace from both cores, showing the deadlock. If you'd like anything else, I can probably provide it.

@kilograham
Copy link
Contributor

possibly the same issue as #1453

@gemarcano
Copy link
Author

From reading that issue, it does sound like the underlying issue is at least similar (deadlock on a spinlock due to interactions between FreeRTOS and pico-sdk locking).

The only thing that is different is that this is being triggered not just by pico-sdk code. Part of the cause in my case is the heap_3 implementation is grabbing the FreeRTOS task lock (via vTaskSuspendAll) before calling free and malloc, and the rp2040 FreeRTOS port locking code also tries to grab this lock from the other core while apparently still holding onto the synchronization spinlock. Could be that both #1453 and this are manifestations of the same issue around the use of the synchronization spinlock and the acquisition order of FreeRTOS locks.

@NewmanIsTheStar
Copy link

Could this bug cause also cause the following assertion?
*assertion "pxYieldSpinLock[get_core_num()] == ((void )0)" failed: file "port.c", line 363

@kilograham
Copy link
Contributor

if you#define configSUPPORT_PICO_SYNC_INTEROP 0 in your FreeRTOSConfig.h and it starts working, then its a pretty good sign it is the same bug

@kilograham kilograham added this to the 1.6.0 milestone May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants