Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SMP issues #6092

Merged
merged 15 commits into from May 15, 2024
Merged

Fix SMP issues #6092

merged 15 commits into from May 15, 2024

Conversation

kishore-sn
Copy link
Contributor

Fix crash in sched_unlock for flat apps SMP
Fix assertion fail in irq_csection during smp repeat test
Fix crash in sched_setaffinity api for loadable apps smp

&g_cpu_irqlock);
}
}
#endif /* CONFIG_SMP */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are moving this code piece out to form a new API --> restore_critical_section(). But here you are not invoking that API.
As I checked, there are quite some other places invoking sched_resume_scheduler(). After removing this code piece, the remaining operation left in this API is wrapped by macros, which mean we might be doing nothing here if all macros are disabled.
Could you please check whether a restore_critical_section() is needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few observations on this.
First of all, I feel sched_resume_scheduler() is supposed to do only scheduler related work and the irqcount related code was placed wrongly in it. When we took the latest code from nuttx, it seems they have corrected this.
Secondly, I see that resume_scheduler() API is called in following places.

os/arch/arm/src/armv7-a/arm_releasepending.c
os/arch/arm/src/armv7-a/arm_cpustart.c
os/arch/arm/src/armv7-a/arm_exit.c
os/arch/arm/src/armv7-a/arm_unblocktask.c
os/arch/arm/src/armv7-a/arm_reprioritizertr.c
os/arch/arm/src/armv7-a/arm_cpupause.c
os/arch/arm/src/armv7-a/arm_blocktask.c

In each of the above cases, the API is called from IRQ context. But now we have moved the irqcount logic to doirq() and so, we would be still performing this check when we are returning from IRQ context. So, I think this is fine.

The only exception is arm_exit(). However, since this code is working well in nuttx and also all our tests seem to be fine, I think we can keep this change for now. If we face any issues in future, we can recheck this part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for detailed explanation, it make sense to me.
But, could you please add a remark or comment in the code piece here, since it might be a potential checkpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a note in the commit msg.

apps/examples/testcase/le_tc/kernel/itc_pthread.c Outdated Show resolved Hide resolved
apps/examples/testcase/le_tc/kernel/tc_mqueue.c Outdated Show resolved Hide resolved
apps/examples/testcase/le_tc/kernel/tc_task.c Outdated Show resolved Hide resolved
os/arch/arm/src/common/up_checkstack.c Show resolved Hide resolved
@kishore-sn kishore-sn changed the title [Under Verification] Fix SMP issues Fix SMP issues Apr 16, 2024
@edwakuwaku
Copy link
Contributor

Hi @kishore-sn , current loadable_apps enable CONFIG_PM by default. Thus, CONFIG_SMP shouldn't be enabled now, because hotplug mode implementation is pending. The logic for dual core + PM is also not verified and it requires some fix and changes I think.

@kishore-sn
Copy link
Contributor Author

Hi @kishore-sn , current loadable_apps enable CONFIG_PM by default. Thus, CONFIG_SMP shouldn't be enabled now, because hotplug mode implementation is pending. The logic for dual core + PM is also not verified and it requires some fix and changes I think.

Ok. I will revert it for now. Let us enable it later.

@sunghan-chang
Copy link
Contributor

@kishore-sn could you check Edward comment?
@r-prabu Once review is done, please merge this.

neel-samsung and others added 6 commits May 14, 2024 11:41
…adable apps smp

When running smp test app using loadable apps config, there is an assert failure
due to improper ordering of system calls in syscall.h due to which call to
getcpu() api is redirected to setaffinity() api. To fix this issue, we modify
the ordering of syscalls.

Signed-off-by: Neelkumar Patel <neel.patel@samsung.com>
When running kernel tc there is an assert fail in sched_unlock.
In this case, sched_unlock is called when the lockcount is already
zero. As a result, checking for the g_cpuschedlock and g_cpulockset
is causing an assert.

But, since the lockcount is already zero, this check is wrong. The
check needs to be done only when sched_unlock is called with a
non-zero lockcount value.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
…ritical section

Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
calling sched_unlock without sched_lock, hence removed it

Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
correct the case according to CONFIG_SMP enable and disable
When we enable SMP config, but set NCPUS as 1, then there is a compile
error. Fix it by adding check for NCPUS.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
kishore-sn and others added 2 commits May 14, 2024 11:41
When printing assert information the stack info for CPU1 idle task
is printed wrong. This is because the adj_stack_ptr value for cpu1
idle task is not initialized properly.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
… printing TCB list in assert logs

when printing the assert information, the size of stack used for each
TCB is printed wrong and showing stack overflow. This is because when
we are checking the TCB stack, we check from stack_alloc_ptr instead of
stack_base_ptr. So, stack_alloc_ptr points to TLS Data, Arguments and
then available stack whereas stack_base_ptr directly points to available
stack.

Signed-off-by: Neelkumar Patel <neel.patel@samsung.com>
@edwakuwaku
Copy link
Contributor

Hello @kishore-sn @sunghan-chang , when we are merging this PR, please also review and merge #6079 together. It contains fixes and changes required for SMP in BSP level.

kishore-sn and others added 5 commits May 14, 2024 11:58
Some random crash issues (data abort, prefetch abort) were noticed
when running various commands like ble_rmc, kernel_tc. Our analysis
showed that the issue is happening since the stack pointer is getting
updated with wrong values after multiple context switches. We found
similar issues reported and fixed in nuttx. So, we pick the below
commits from nuttx to fix our issues.

Note: We deleted sched_resume_scheduler() in this commit. sched_resume_scheduler()
is supposed to do only scheduler related work and the irqcount related code
was placed wrongly in it. When we took the latest code from nuttx, it seems they have corrected this.

Also, I see that resume_scheduler() API is called in following places.

os/arch/arm/src/armv7-a/arm_releasepending.c
os/arch/arm/src/armv7-a/arm_cpustart.c
os/arch/arm/src/armv7-a/arm_exit.c
os/arch/arm/src/armv7-a/arm_unblocktask.c
os/arch/arm/src/armv7-a/arm_reprioritizertr.c
os/arch/arm/src/armv7-a/arm_cpupause.c
os/arch/arm/src/armv7-a/arm_blocktask.c

In each of the above cases, the API is called from IRQ context. But now we have moved the irqcount
logic to doirq() and so, we would be still performing this check when we are returning from IRQ context.

The only exception is arm_exit(). However, since this code is working well in nuttx and also all our
tests seem to be fine, I think we can keep this change for now. If we face any issues in future, we can recheck this part.

======================================================================
commit e2df52390ace192e8fe632ecb83c2fa44373259a

    SMP: fix crash when switch to new task which is still running

    Situation:

    Assume we have 2 cpus, and busy run task0.

    CPU0                                CPU1
    task0 -> task1                      task2 -> task0
    1. remove task0 form runninglist
    2. take task1 as new tcb
    3. add task0 to blocklist
    4. clear spinlock
                                        4.1 remove task2 form runninglist
                                        4.2 take task0 as new tcb
                                        4.3 add task2 to blocklist
                                        4.4 use svc ISR swith to task0
                                        4.5 crash
    5. use svc ISR swith to task1

    Fix:
    Move clear spinlock to the end of svc ISR

commit 2241969e5ada2eaafee38ab6de6457aa15c8402c

    SMP: fix crash when switch to new task which is still running

    cpu0 thread0:                        cpu1:
    sched_yield()
    nxsched_set_priority()
    nxsched_running_setpriority()
    nxsched_reprioritize_rtr()
    nxsched_add_readytorun()
    up_cpu_pause()
                                         IRQ enter
                                         arm64_pause_handler()
                                         enter_critical_section() begin
                                         up_cpu_paused() pick thread0
                                         arm64_restorestate() set thread0 tcb->xcp.regs to CURRENT_REGS
    up_switch_context()
      thread0 -> thread1
    arm64_syscall()
        case SYS_switch_context
         change thread0 tcb->xcp.regs
        restore_critical_section()
                                         enter_critical_section() done
                                         leave_critical_section()
                                         IRQ leave with restore CURRENT_REGS
                                         ERROR !!!

    Reason:
    As descript above, cpu0 swith task: thread0 -> thread1, and the
    syscall() execute slowly, this time cpu1 pick thread0 to run at
    up_cpu_paused(). Then cpu0 syscall execute, cpu1 IRQ leave error.

    Resolve:
    Move arm64_restorestate() after enter_critical_section() done

======================================================================

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
When running smp test app using flat apps config, there is an
assert fail due to improper values of lockcount and g_cpu_lockset.
This is because we are incrementing lockcount in prepare_exit API
without the corresponding setting of g_cpu_lockset. To fix this issue
we modify prepare_exit API to increment and decrement lockcount along
with set and clear of g_cpu_lockset.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
…test

When we run smp test app repeatedly on flat_apps config, there is an assert
fail in irq_csection sometimes.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
When we run wm_test scan command, there is an assert failure in
sched_unlock. This is because the current tcb has a non-zero lock count,
whereas the g_cpu_lockset variable is zero.

The fix contains 2 parts:
1. In prepare_exit API, perform set and clear of the lockset only if it
is not already set. If lockset is already set by another process, then the
exiting process must not clear it.

2. In task_exit API, when we try to get the new task at the head of the
readytorun list, we must use this_cpu() to get the current cpu. This is
because during task_exit, the task might get blocked and switched to a
new cpu. So, while trying to get the new task, we need to make sure that
we are using the correct cpu.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
Some kernel test cases are written for single core. So, add additional
control logic to control the flow of tasks when SMP is enabled

Signed-off-by: Abhishek Akkabathula <a.akkabathul@samsung.com>
@edwakuwaku
Copy link
Contributor

Hi @sunghan-chang , as discussed before, we might need some time until we finish integration for PM+SMP. In the meantime, will we create another defconfig for SMP first? So that we can verify it with onboarding and OTN process.
Hello @kishore-sn , could you please share your thoughts as well~?

Remove redundant code related to SMP

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
@kishore-sn
Copy link
Contributor Author

Hi @sunghan-chang , as discussed before, we might need some time until we finish integration for PM+SMP. In the meantime, will we create another defconfig for SMP first? So that we can verify it with onboarding and OTN process. Hello @kishore-sn , could you please share your thoughts as well~?

Yes. I agree. I have created a temporary defconfig with loadable apps SMP. However, currently there is a build error with this config which is related to PM. If we comment out "arm_gic_raise_softirq(1, 0);", then the build is successful.

@edwakuwaku
Copy link
Contributor

Hi @sunghan-chang , as discussed before, we might need some time until we finish integration for PM+SMP. In the meantime, will we create another defconfig for SMP first? So that we can verify it with onboarding and OTN process. Hello @kishore-sn , could you please share your thoughts as well~?

Yes. I agree. I have created a temporary defconfig with loadable apps SMP. However, currently there is a build error with this config which is related to PM. If we comment out "arm_gic_raise_softirq(1, 0);", then the build is successful.

Hi Kishore, currently you are enabling PM in the SMP config. It will have no problem because the state is locked in PM_NORMAL. the arm_gic_raise_softirq(1, 0) is rtk realization, in TizenRT, we should use arm_cpu_sgi(), both are serving the same purpose, which is to send signal (ie. SGI) to target cores under same cluster.
There will be 2 option:

  1. Keep CONFIG_PM, but please don't allow state drop as of now
  2. Disable CONFIG_PM
    If we go for first option, you may comment out the code first, and I will make necessary changes in my PR. But still, the PM under SMP condition should not work as expected~

This is a temp defconfig in which SMP is enabled with loadable apps.
It will be used for development and testing of PM with SMP. Once, it
is finalised, then SMP will be enabled by default in flat and loadable
apps configs.

This commit has also temporarily commented out the call to PM api
arm_gic_raise_softirq() to fix a build error. This also needs to be
reverted once PM implementation is completed and verified.

Signed-off-by: Kishore S N <kishore.sn@samsung.com>
@kishore-sn
Copy link
Contributor Author

Hi @sunghan-chang , as discussed before, we might need some time until we finish integration for PM+SMP. In the meantime, will we create another defconfig for SMP first? So that we can verify it with onboarding and OTN process. Hello @kishore-sn , could you please share your thoughts as well~?

Yes. I agree. I have created a temporary defconfig with loadable apps SMP. However, currently there is a build error with this config which is related to PM. If we comment out "arm_gic_raise_softirq(1, 0);", then the build is successful.

Hi Kishore, currently you are enabling PM in the SMP config. It will have no problem because the state is locked in PM_NORMAL. the arm_gic_raise_softirq(1, 0) is rtk realization, in TizenRT, we should use arm_cpu_sgi(), both are serving the same purpose, which is to send signal (ie. SGI) to target cores under same cluster. There will be 2 option:

  1. Keep CONFIG_PM, but please don't allow state drop as of now
  2. Disable CONFIG_PM
    If we go for first option, you may comment out the code first, and I will make necessary changes in my PR. But still, the PM under SMP condition should not work as expected~

@edwakuwaku
We will use option 1. I have commented out the API call and created a temporary commit.

@edwakuwaku
Copy link
Contributor

Hi @kishore-sn , okay. Btw, don't you need to update .circleci/config.yml? We should include for server build as well.

Copy link
Contributor

@edwakuwaku edwakuwaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

@sunghan-chang sunghan-chang merged commit 5f38275 into Samsung:master May 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants