Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QCA7000] SPI missed interrupt #5766

Open
samsamfire opened this issue Dec 2, 2023 · 13 comments
Open

[QCA7000] SPI missed interrupt #5766

samsamfire opened this issue Dec 2, 2023 · 13 comments

Comments

@samsamfire
Copy link

samsamfire commented Dec 2, 2023

Describe the bug

Dear all,

I have a compute module 3 board connected to 2 MCP2515 chips and 1 QCA7000 chip (EV charging applications).
2-3 years ago we noticed a problem in the interrupt handling of the QCA7000. After a variable amount of time (couple of minutes to couple of hours) we would miss an interrupt on the QCA7000 chip, which would basically "freeze" the communication.
We found out that basically, if the interrupt between QCA7000 and MCP2515 arrived at exactly the same time (us level), the interrupt would be missed on the QCA7000 interface. We spent some time digging trying to understand why this happened, but gave up in the end thinking that it was a hardware issue. We implemented a "dirty" fix in the qca_spi.c driver additionnaly to poll the QCA7000 every 50ms.
Several years have passed, and I am digging up this old problem that we were having at the time. We have seen on the field that sometimes the driver seems to "stop" receiving messages. I have also seen that sometimes, some packets can get dropped, and after reading extensively the QCA7000 datasheet and knowing that the RX buffer is pretty small, it turns out that this 50ms polling has just reduced the occurence of the problem but I don't think it has resolved the root cause. I have done some tests on STM32MP1 with only QCA as SPI device (CAN is integrated) and never get dropped packets / frozen interface.

Anyway, after reading this very interesting thread #1490 about 2 MCP2515 missing interrupts, I think that we are in the same case. The QCA7000 is indeed a level-triggered (high) interrupt, just like the MCP2515. We never saw this problem with the MCP2515 because we were running a 5.4.y kernel with this patch.

In #2267 @pelwell said "If this is effective we should reconsider all usage of edge-triggered interrupts". I don't know what the status is on this ? Also "ONESHOT" flag mentionned is not present in qcaspi driver. I don't exactly understand what it's supposed to do, but if we were to use a level-triggered interrupt shouldn't that also be added ?

What are your thoughts on this ? Has anybody experienced similar issues with other SPI devices ?

Thanks !

Steps to reproduce the behaviour

Initiate repeated SLAC with QCA, after a certain time interface will freeze (no more read messages).
CAN0 & CAN1 are also running.

Device (s)

Raspberry Pi CM3+

System

We are using raspbian buster with kernel v5.4.77.
I'll get back to you for the "firmware version".

Logs

No response

Additional context

No response

@lategoodbye
Copy link
Contributor

lategoodbye commented Dec 2, 2023

Hi, as the guy who mainlined this qca_spi driver back in 2014, it's nice to see some feedback.

In order to help you here, it would be nice if you could please provide more specific information:

  • what are your config.txt settings to see how these devices are connected, especially the SPI interrupts
  • in case this freeze happens what is the output of /sys/kernel/debug/ethX/info (replace X with the number of your qca7000 network interface)

Looking at qca7000-overlay.dts shows that the interrupt is wrongly specified as 1 ( = IRQ_TYPE_EDGE_RISING ).
Does changing to 4 ( = IRQ_TYPE_LEVEL_HIGH ) avoid this issue?

Btw please update your kernel, the driver received a lot of fixes and improvements (especially 429205d )

pelwell added a commit to pelwell/linux that referenced this issue Dec 5, 2023
The QCA7000 interrupt should be level-triggered.

See: raspberrypi#5766

Signed-off-by: Phil Elwell <phil@raspberrypi.com>
pelwell added a commit to pelwell/linux that referenced this issue Dec 5, 2023
The QCA7000 interrupt should be level-triggered.

See: raspberrypi#5766

Signed-off-by: Phil Elwell <phil@raspberrypi.com>
@pelwell
Copy link
Contributor

pelwell commented Dec 5, 2023

The "ONESHOT" flag tells the kernel that when that interrupt occurs, further interrupts from that source should be disabled until the current interrupt has finished being processed, at which point it should be reenabled (I think "one-at-a-time" is a better description, but not as concise). This is particularly import when the interrupt handler is going to be run in a thread, otherwise the thread may never get a chance to run.

Notice that the qca7000 driver doesn't use request_threaded_irq - it uses the normal request_irq. However, the Pi kernel config files have CONFIG_IRQ_FORCED_THREADING=y, which coerces most interrupt handlers into threads but also specifies IRQF_ONESHOT, so you don't need to be concerned about that.

Pull request #5771 has a patched overlay that requests a level-triggered interrupt. You can install it (on your 5.4. kernel) with:

$ wget https://github.com/raspberrypi/linux/raw/1faec7b5f55c5245977267371539b35b6000896a/arch/arm/boot/dts/overlays/qca7000-overlay.dts
$ dtc -@ -Hepapr -I dts -O dtb -o qca7000.dtbo qca7000-overlay.dts
$ sudo cp qca7000.dtbo /boot/overlays

@samsamwatt
Copy link

Hi,

@lategoodbye, thanks for the feedback, before trying to spend more time debugging I've upgraded to a more recent kernel version (5.10), especially because I have seen some commits since end of 2020 to fix some major bugs in SPI handling.
I'd like to start on a clean slate.
For your information, the config.txt looks like this :

dtoverlay=spi1-1cs,cs0_spidev=off
dtoverlay=spi2-1cs,cs0_spidev=off
dtoverlay=mcp2515-can0-overlay,oscillator=20000000,interrupt=1
dtoverlay=mcp2515-can1-overlay,oscillator=20000000,interrupt=2
dtoverlay=spi-bcm2835-overlay
dtoverlay=qca7000

@pelwell thanks for the details !

Yes I also saw thought about changing the IRQ_TYPE_LEVEL_HIGH, however after doing some testing with version 5.4 and 5.10 with the IRQ_TYPE_LEVEL_HIGH I get a kernel panic (I lose SSH connection straight away and can't reconnect) after communicating on the interface.

I am a bit confused, either my documentation/understanding is wrong (but It's under NDA, so difficult to share more info) or there is another bug.

This issue is not easy to reproduce and I believe that there maybe other issues that are linked with the QCA firmware rather than the driver itself. Our setup is pretty simple : PEV on one side EVSE on the other + sniffer on PLC line and we try to SLAC indefinitely. We start not receiving SLACs at some point on EVSE side (us).

If I'll keep you posted if I get more information.

On a different note, is it completely "unexpected" to get dropped message from this interface ? Because for the MCP2515 devices we do get some dropped CAN frames, but my understanding is that with the tiny RX buffer (I believe 2 or 3 frames) the interrupt is not guaranteed to be handled in time. I believe at 500kbit/s the max frames per second is around 4k, so the kernel has to guarantee handling in less than 500us.

@pelwell
Copy link
Contributor

pelwell commented Dec 5, 2023

5.10 is still ancient - we're shipping 6.1 now, and preparing for a move to 6.6 next year - but it might be OK for your needs.

dtoverlay=mcp2515-can0-overlay,oscillator=20000000,interrupt=1
dtoverlay=mcp2515-can1-overlay,oscillator=20000000,interrupt=2

The correct names are mcp2515-can0 and mcp2515-can - don't include the -overlay.

Are you really using GPIOs 1 & 2 as interrupt GPIO pins? It's not necessarily wrong, but it is unusual since GPIOs 0-3 are often used for I2C, with external pull-ups on most platforms.

dtoverlay=spi-bcm2835-overlay

This overlay doesn't exist, and wouldn't be useful on rpi-5.10.y because that's the default SPI driver.

@lategoodbye
Copy link
Contributor

@samsamfire Tested IRQ_TYPE_LEVEL_HIGH on our reference platform and the whole system freeze during boot. Currently i don't know when i find the time to dig in deeper.

@samsamfire
Copy link
Author

I have tested without our "polling patch" on newer kernel version (latest 5.10.Y) and after some testing on the field we get exactly the same phenomenon : missed interrupts so we don't reply to EV in time which causes a timeout. The occurrence is relatively low, but still unacceptable.
I am not sure I will get the time to investigate further on the issue, I think I will have to reapply our polling patch for the time being.
@lategoodbye does your information on the QCA7000 say that it should be a level triggered interrupt ? Or are we missing something?

@lategoodbye
Copy link
Contributor

@samsamfire My problem is that i don't have the setup to reproduce your problem. So a dump of the info file after the error as occured (without polling patch) as requested before would still be nice.

To your question: i only want to say, please stick with the current level triggered interrupt configuration.

@pelwell
Copy link
Contributor

pelwell commented Dec 7, 2023

Do you mean "current edge-triggered interrupt configuration"?

@lategoodbye
Copy link
Contributor

Sorry, you are right. I meant "edge-triggered interrupt configuration".

@samsamfire
Copy link
Author

Hi,

Typical output of /sys/kernel/debug/ when there is the problem :
image
I am certain there is a problem with the way the interrupt is defined :
If the interrupt gets missed, I can see the interrupt pin stay high indefinitely. Reading device attributes, etc using plctool does not work either, we get no response.

@pelwell
Copy link
Contributor

pelwell commented Dec 11, 2023

If a hardware device is designed to generate a level-triggering interrupt but the interrupt is configured to be edge-triggered, everything will be fine until the first occasion an interrupt is missed; from that point, no further interrupts will be received, even though the device is doing its best to request attention, because there will no more edges on the interrupt line.

@samsamfire
Copy link
Author

Yes, agreed. Thats why I opened this issue in the first place.
If I get the time I will try to investigate.

I am not 100% sure but there could be different types of interrupts (level & edge ) depending on the state, but I dont know why that would freeze the system.

@pelwell
Copy link
Contributor

pelwell commented Dec 11, 2023

One explanation for the freeze would be if the disabling of interrupts required by ONESHOT operation didn't actually work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants