New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SMP issues #6092
Fix SMP issues #6092
Conversation
&g_cpu_irqlock); | ||
} | ||
} | ||
#endif /* CONFIG_SMP */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are moving this code piece out to form a new API --> restore_critical_section(). But here you are not invoking that API.
As I checked, there are quite some other places invoking sched_resume_scheduler(). After removing this code piece, the remaining operation left in this API is wrapped by macros, which mean we might be doing nothing here if all macros are disabled.
Could you please check whether a restore_critical_section() is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have few observations on this.
First of all, I feel sched_resume_scheduler() is supposed to do only scheduler related work and the irqcount related code was placed wrongly in it. When we took the latest code from nuttx, it seems they have corrected this.
Secondly, I see that resume_scheduler() API is called in following places.
os/arch/arm/src/armv7-a/arm_releasepending.c
os/arch/arm/src/armv7-a/arm_cpustart.c
os/arch/arm/src/armv7-a/arm_exit.c
os/arch/arm/src/armv7-a/arm_unblocktask.c
os/arch/arm/src/armv7-a/arm_reprioritizertr.c
os/arch/arm/src/armv7-a/arm_cpupause.c
os/arch/arm/src/armv7-a/arm_blocktask.c
In each of the above cases, the API is called from IRQ context. But now we have moved the irqcount logic to doirq() and so, we would be still performing this check when we are returning from IRQ context. So, I think this is fine.
The only exception is arm_exit(). However, since this code is working well in nuttx and also all our tests seem to be fine, I think we can keep this change for now. If we face any issues in future, we can recheck this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for detailed explanation, it make sense to me.
But, could you please add a remark or comment in the code piece here, since it might be a potential checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a note in the commit msg.
58d76c8
to
6855af4
Compare
Hi @kishore-sn , current loadable_apps enable CONFIG_PM by default. Thus, CONFIG_SMP shouldn't be enabled now, because hotplug mode implementation is pending. The logic for dual core + PM is also not verified and it requires some fix and changes I think. |
Ok. I will revert it for now. Let us enable it later. |
@kishore-sn could you check Edward comment? |
…adable apps smp When running smp test app using loadable apps config, there is an assert failure due to improper ordering of system calls in syscall.h due to which call to getcpu() api is redirected to setaffinity() api. To fix this issue, we modify the ordering of syscalls. Signed-off-by: Neelkumar Patel <neel.patel@samsung.com>
When running kernel tc there is an assert fail in sched_unlock. In this case, sched_unlock is called when the lockcount is already zero. As a result, checking for the g_cpuschedlock and g_cpulockset is causing an assert. But, since the lockcount is already zero, this check is wrong. The check needs to be done only when sched_unlock is called with a non-zero lockcount value. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
…ritical section Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
calling sched_unlock without sched_lock, hence removed it Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
correct the case according to CONFIG_SMP enable and disable
When we enable SMP config, but set NCPUS as 1, then there is a compile error. Fix it by adding check for NCPUS. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
When printing assert information the stack info for CPU1 idle task is printed wrong. This is because the adj_stack_ptr value for cpu1 idle task is not initialized properly. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
… printing TCB list in assert logs when printing the assert information, the size of stack used for each TCB is printed wrong and showing stack overflow. This is because when we are checking the TCB stack, we check from stack_alloc_ptr instead of stack_base_ptr. So, stack_alloc_ptr points to TLS Data, Arguments and then available stack whereas stack_base_ptr directly points to available stack. Signed-off-by: Neelkumar Patel <neel.patel@samsung.com>
Hello @kishore-sn @sunghan-chang , when we are merging this PR, please also review and merge #6079 together. It contains fixes and changes required for SMP in BSP level. |
Some random crash issues (data abort, prefetch abort) were noticed when running various commands like ble_rmc, kernel_tc. Our analysis showed that the issue is happening since the stack pointer is getting updated with wrong values after multiple context switches. We found similar issues reported and fixed in nuttx. So, we pick the below commits from nuttx to fix our issues. Note: We deleted sched_resume_scheduler() in this commit. sched_resume_scheduler() is supposed to do only scheduler related work and the irqcount related code was placed wrongly in it. When we took the latest code from nuttx, it seems they have corrected this. Also, I see that resume_scheduler() API is called in following places. os/arch/arm/src/armv7-a/arm_releasepending.c os/arch/arm/src/armv7-a/arm_cpustart.c os/arch/arm/src/armv7-a/arm_exit.c os/arch/arm/src/armv7-a/arm_unblocktask.c os/arch/arm/src/armv7-a/arm_reprioritizertr.c os/arch/arm/src/armv7-a/arm_cpupause.c os/arch/arm/src/armv7-a/arm_blocktask.c In each of the above cases, the API is called from IRQ context. But now we have moved the irqcount logic to doirq() and so, we would be still performing this check when we are returning from IRQ context. The only exception is arm_exit(). However, since this code is working well in nuttx and also all our tests seem to be fine, I think we can keep this change for now. If we face any issues in future, we can recheck this part. ====================================================================== commit e2df52390ace192e8fe632ecb83c2fa44373259a SMP: fix crash when switch to new task which is still running Situation: Assume we have 2 cpus, and busy run task0. CPU0 CPU1 task0 -> task1 task2 -> task0 1. remove task0 form runninglist 2. take task1 as new tcb 3. add task0 to blocklist 4. clear spinlock 4.1 remove task2 form runninglist 4.2 take task0 as new tcb 4.3 add task2 to blocklist 4.4 use svc ISR swith to task0 4.5 crash 5. use svc ISR swith to task1 Fix: Move clear spinlock to the end of svc ISR commit 2241969e5ada2eaafee38ab6de6457aa15c8402c SMP: fix crash when switch to new task which is still running cpu0 thread0: cpu1: sched_yield() nxsched_set_priority() nxsched_running_setpriority() nxsched_reprioritize_rtr() nxsched_add_readytorun() up_cpu_pause() IRQ enter arm64_pause_handler() enter_critical_section() begin up_cpu_paused() pick thread0 arm64_restorestate() set thread0 tcb->xcp.regs to CURRENT_REGS up_switch_context() thread0 -> thread1 arm64_syscall() case SYS_switch_context change thread0 tcb->xcp.regs restore_critical_section() enter_critical_section() done leave_critical_section() IRQ leave with restore CURRENT_REGS ERROR !!! Reason: As descript above, cpu0 swith task: thread0 -> thread1, and the syscall() execute slowly, this time cpu1 pick thread0 to run at up_cpu_paused(). Then cpu0 syscall execute, cpu1 IRQ leave error. Resolve: Move arm64_restorestate() after enter_critical_section() done ====================================================================== Signed-off-by: Kishore S N <kishore.sn@samsung.com>
When running smp test app using flat apps config, there is an assert fail due to improper values of lockcount and g_cpu_lockset. This is because we are incrementing lockcount in prepare_exit API without the corresponding setting of g_cpu_lockset. To fix this issue we modify prepare_exit API to increment and decrement lockcount along with set and clear of g_cpu_lockset. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
…test When we run smp test app repeatedly on flat_apps config, there is an assert fail in irq_csection sometimes. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
When we run wm_test scan command, there is an assert failure in sched_unlock. This is because the current tcb has a non-zero lock count, whereas the g_cpu_lockset variable is zero. The fix contains 2 parts: 1. In prepare_exit API, perform set and clear of the lockset only if it is not already set. If lockset is already set by another process, then the exiting process must not clear it. 2. In task_exit API, when we try to get the new task at the head of the readytorun list, we must use this_cpu() to get the current cpu. This is because during task_exit, the task might get blocked and switched to a new cpu. So, while trying to get the new task, we need to make sure that we are using the correct cpu. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
Some kernel test cases are written for single core. So, add additional control logic to control the flow of tasks when SMP is enabled Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
Hi @sunghan-chang , as discussed before, we might need some time until we finish integration for PM+SMP. In the meantime, will we create another defconfig for SMP first? So that we can verify it with onboarding and OTN process. |
Remove redundant code related to SMP Signed-off-by: Kishore S N <kishore.sn@samsung.com>
Yes. I agree. I have created a temporary defconfig with loadable apps SMP. However, currently there is a build error with this config which is related to PM. If we comment out "arm_gic_raise_softirq(1, 0);", then the build is successful. |
Hi Kishore, currently you are enabling PM in the SMP config. It will have no problem because the state is locked in PM_NORMAL. the arm_gic_raise_softirq(1, 0) is rtk realization, in TizenRT, we should use arm_cpu_sgi(), both are serving the same purpose, which is to send signal (ie. SGI) to target cores under same cluster.
|
This is a temp defconfig in which SMP is enabled with loadable apps. It will be used for development and testing of PM with SMP. Once, it is finalised, then SMP will be enabled by default in flat and loadable apps configs. This commit has also temporarily commented out the call to PM api arm_gic_raise_softirq() to fix a build error. This also needs to be reverted once PM implementation is completed and verified. Signed-off-by: Kishore S N <kishore.sn@samsung.com>
@edwakuwaku |
Hi @kishore-sn , okay. Btw, don't you need to update .circleci/config.yml? We should include for server build as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
Fix crash in sched_unlock for flat apps SMP
Fix assertion fail in irq_csection during smp repeat test
Fix crash in sched_setaffinity api for loadable apps smp