Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLC responses to backend with the same ID are not in correct order #643

Open
farzadfch opened this issue Oct 8, 2020 · 9 comments · May be fixed by #914
Open

LLC responses to backend with the same ID are not in correct order #643

farzadfch opened this issue Oct 8, 2020 · 9 comments · May be fixed by #914
Assignees
Milestone

Comments

@farzadfch
Copy link
Contributor

My understanding of AXI spec is that transactions with the same ID must be responded in the order they were received on the request channel. However, it seems that there are corner cases in which LLC does not comply with this rule. I try to explain what I have found on the attached waveform.

At marker C, LLC receives a request on address 8000B1C0 with ID = 01. This request is a miss and is sent to DRAM at marker D.
At B and F, LLC receives requests on addresses 8002B1C0 and 80001000, respectively, with ID = 01.
The request on 80001000 is a hit and LLC responds to this at marker E without following the correct ordering. So, the backend perceives this as the response to 8000B1C0 and releases the response on the memory model R channel earlier that it should (I have not included the waveform for this channel).

Please let me know if my understanding is correct.

Capture3

@davidbiancolin
Copy link
Contributor

Thanks for the detailed description, your understanding is correct -- this is a pretty serious timing bug.

An immediate workaround is to limit ID reuse to 1 in the target and increase the ID width of the interface to support the same number of transactions. Do you happen to know what agent is generating the requests with the same ID?

@davidbiancolin
Copy link
Contributor

I think in the short term the workaround is more useful as it used to be what we did in FireChip before the Chipyard migration. The fact we lost that modification was an oversight resulting from my haste and our desire to use upstream code. That said the easiest fix is to block on requests to the same ID. Unfortunately, the front end of the cache schedules transactions in FIFO order so that would lead to HOL blocking. If you wanted to implement that you could mirror this but instead look for an ID collision, and then L189 would read:

val can_deq_read = reads.valid && !read_set_collision && !miss_resource_hazard && io.rResp.ready && !read_id_inflight

You'd have to do the same with writes.

A higher performance, more complex but more accurate solution is to improve the cache frontend to let it schedule across transactions in the read and write buffers.

@farzadfch
Copy link
Contributor Author

Thank you, David. Those requests were generated by different agents (core 3, core 1, and serial adaptor). I think ID reuse is caused by AXI ID width being smaller than TL source width at mbus.

I actually use FireSim 1.5 and my target is a Quad-core BOOM with 4 MSHRs in each data cache. That results in a 9-bit TL source at mbus input. I have previously increased the AXI ID width to 5 bits from the default 4, but that obviously wasn't enough.

So I set the mbus ID width to 9 and it seems that this limits the ID reuse to 1 and fixes the original problem. (I also had to increase the ID width in the memory model here).

I still want to know if setting the ID width equal to the TL source width is the most effeinct way to limit ID reuse to 1. Most of combinations of that 9 bits are unused, so this seems not very efficient.

@davidbiancolin
Copy link
Contributor

In the UserYanker you can limit maxFlight to some number (specifically by setting it to Some(1)). In order to achieve the best system performance you really need to finesse how TL sources are folded onto to AXI4 IDs. I'd have to RTFC for that -- it would likely require some diplomacy hacking.

1.5 is a fairly old version, i'd highly encourage that you bump to a later one with the SiFive L2. In this system just the cache masters the mbus and so it would be trivial to get more uniform use of the ID space. + We have a bunch of new features.

@farzadfch
Copy link
Contributor Author

Thanks David. Not trying to get off the topic, but as for the reason why I am still using 1.5, I tried multi-core BOOM with SiFive L2 on 1.7 and 1.8, and it hangs very often.

@jerryz123
Copy link
Contributor

I think all of those BOOM issues have been resolved by 1.10.

@davidbiancolin davidbiancolin added this to the 1.11.0 milestone Nov 9, 2020
@davidbiancolin davidbiancolin self-assigned this Nov 9, 2020
@farzadfch
Copy link
Contributor Author

@jerryz123, I understand that BOOM has changed a lot since Firesim 1.5, but can you pinpoint what exactly has been fixed that let BOOM to work with SiFive L2? I am thinking of using SiFive L2 with 1.5 so I don't have to go through the pain of porting my custom hardware and software to the latest Firesim version.

@jerryz123
Copy link
Contributor

I believe this PR had some fixes. chipsalliance/rocket-chip#2053

@davidbiancolin davidbiancolin modified the milestones: 1.11.0, 1.12.0 Jan 11, 2021
@davidbiancolin davidbiancolin modified the milestones: 1.12.0, 1.13.0 May 25, 2021
@davidbiancolin
Copy link
Contributor

For 1.13.0, add an elaboration-time error if ID reuse is enabled under a model that may reorder across IDs.

Longer term fix, limit inflight requests in the transaction scheduler.

@davidbiancolin davidbiancolin linked a pull request Jan 27, 2022 that will close this issue
11 tasks
@davidbiancolin davidbiancolin modified the milestones: 1.13.0, 1.14.0 Feb 17, 2022
@davidbiancolin davidbiancolin modified the milestones: 1.14.0, 1.15.0 Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants