New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLC responses to backend with the same ID are not in correct order #643
Comments
Thanks for the detailed description, your understanding is correct -- this is a pretty serious timing bug. An immediate workaround is to limit ID reuse to 1 in the target and increase the ID width of the interface to support the same number of transactions. Do you happen to know what agent is generating the requests with the same ID? |
I think in the short term the workaround is more useful as it used to be what we did in FireChip before the Chipyard migration. The fact we lost that modification was an oversight resulting from my haste and our desire to use upstream code. That said the easiest fix is to block on requests to the same ID. Unfortunately, the front end of the cache schedules transactions in FIFO order so that would lead to HOL blocking. If you wanted to implement that you could mirror this but instead look for an ID collision, and then L189 would read:
You'd have to do the same with writes. A higher performance, more complex but more accurate solution is to improve the cache frontend to let it schedule across transactions in the read and write buffers. |
Thank you, David. Those requests were generated by different agents (core 3, core 1, and serial adaptor). I think ID reuse is caused by AXI ID width being smaller than TL source width at mbus. I actually use FireSim 1.5 and my target is a Quad-core BOOM with 4 MSHRs in each data cache. That results in a 9-bit TL source at mbus input. I have previously increased the AXI ID width to 5 bits from the default 4, but that obviously wasn't enough. So I set the mbus ID width to 9 and it seems that this limits the ID reuse to 1 and fixes the original problem. (I also had to increase the ID width in the memory model here). I still want to know if setting the ID width equal to the TL source width is the most effeinct way to limit ID reuse to 1. Most of combinations of that 9 bits are unused, so this seems not very efficient. |
In the UserYanker you can limit maxFlight to some number (specifically by setting it to 1.5 is a fairly old version, i'd highly encourage that you bump to a later one with the SiFive L2. In this system just the cache masters the mbus and so it would be trivial to get more uniform use of the ID space. + We have a bunch of new features. |
Thanks David. Not trying to get off the topic, but as for the reason why I am still using 1.5, I tried multi-core BOOM with SiFive L2 on 1.7 and 1.8, and it hangs very often. |
I think all of those BOOM issues have been resolved by 1.10. |
@jerryz123, I understand that BOOM has changed a lot since Firesim 1.5, but can you pinpoint what exactly has been fixed that let BOOM to work with SiFive L2? I am thinking of using SiFive L2 with 1.5 so I don't have to go through the pain of porting my custom hardware and software to the latest Firesim version. |
I believe this PR had some fixes. chipsalliance/rocket-chip#2053 |
For 1.13.0, add an elaboration-time error if ID reuse is enabled under a model that may reorder across IDs. Longer term fix, limit inflight requests in the transaction scheduler. |
My understanding of AXI spec is that transactions with the same ID must be responded in the order they were received on the request channel. However, it seems that there are corner cases in which LLC does not comply with this rule. I try to explain what I have found on the attached waveform.
At marker C, LLC receives a request on address 8000B1C0 with ID = 01. This request is a miss and is sent to DRAM at marker D.
At B and F, LLC receives requests on addresses 8002B1C0 and 80001000, respectively, with ID = 01.
The request on 80001000 is a hit and LLC responds to this at marker E without following the correct ordering. So, the backend perceives this as the response to 8000B1C0 and releases the response on the memory model R channel earlier that it should (I have not included the waveform for this channel).
Please let me know if my understanding is correct.
The text was updated successfully, but these errors were encountered: