Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arch-riscv: remove branching from vset{i}vl{i} instruction and allow registers vl and vtype to be renamed #885

Open
saul44203 opened this issue Feb 21, 2024 · 6 comments
Labels
arch-riscv The RISC-V ISA

Comments

@saul44203
Copy link

saul44203 commented Feb 21, 2024

Currently the vset{i}vl{i} family of instructions is branch-like. This works well for some applications, but for others, especially those where the vset{i}vl{i} instruction is inside a loop which changes the vl/vtype state on every iteration, do not perform well due to constant squashing.

The proposal aims to eliminate the branch-like behavior of the vset{i}vl{i} instructions and, instead, set the vl and vtype registers as renameable, allowing for normal data dependencies to be created with subsequent vector instructions in the execution flow.

@adriaarmejach and I discussed two approaches:

  • One where both registers are treated as integer registers.
  • One where both registers pertain to a new class of renameable misc registers.

The later approach seems cleaner (new register class addition would be similar to the changes in commit fed81f3) but it involves more overall code compared with the first one.

Open to discussion.

@powerjg
Copy link
Contributor

powerjg commented Feb 21, 2024

I'm going to push back on this solution because the spec says that vset{i}vl{i} is a control dependency, not a data dependency. See https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#9-vector-memory-consistency-model

Correct me if I'm not interpreting the spec correctly :)

@aarmejach
Copy link
Contributor

Yes, the spec says vl dependencies are treated as control, which allow speculation to happen as long as vl remains unchanged. However, this is not the case in many applications, specially sparse kernels that typically change vl often, which end up performing worse than scalar versions.

Having vl as a data dependency is more restrictive, does not break spec, and would serialize execution always. So some commercial proposals are actually implementing vl dependencies as data dependencies and allow both vtype and vl CSRs to be renamed. Which is bound to perform well in cases where vl stays constant as well as in sparse applications.

I understand that gem5 should be as close to spec as possible, so if this is not something we want to have upstream we will keep it internal.

@powerjg
Copy link
Contributor

powerjg commented Feb 23, 2024

It's an intriguing option. I also think looking into speculating on vl would also be an interesting direction.

I think I know the answer to this question, but I'll ask anyway. Do you think it's feasible to make it a configuration-time parameter whether vl is treated as a control or data dependence? It would be quite interesting if it was parameterized like whether the LSQs enforce TSO.

The other direction we can discuss is how to make the ISA spec implementation of these instructions flexible enough to enable both "in core" vector operations (i.e., the vector ops are renamed/executed/etc in the same pipeline as all other instructions) and a vector co-processor design. In this case, we could have different implementations that hit different points in the design space.

I think this needs to be a longer conversation. To have the conversation we could go two ways:

  1. We could discuss it at our next developer meeting.
  2. We could go through a proposal/design document phase.

In other words, I would like to discuss the overall goals/requirements for this model, the possible implementations, and the tradeoffs of these implementations. We could do this either synchronously on zoom or asynchronously with a lot more writing.

@ivanaamit ivanaamit added the arch-riscv The RISC-V ISA label Mar 5, 2024
@aarmejach
Copy link
Contributor

Sorry it took a while to get back to you on this one.

I wanted to understand the amount of changes we had to do to see if the parametric solution is feasible. As we both were anticipating, changes in the isa definition side are necessary and making these parametric would not be easy. But will take a look after we have tested the prototype implementation we have.

The second topic is an interesting one. We are currently focused on scaling our simulations and adding other features (that are being upstreamed), but midterm I think this discussion will be of interest. I will try to get onto a developers meeting (if schedule permits), and we can kick start this discussion and move on to focused zoom sessions if necessary.

@ivanaamit
Copy link
Contributor

Hi @aarmejach, we have a developer meeting scheduled for next week on April 11th at 4pm UTC. Would you be available to join? If so, how would you prefer to discuss this topic? Are you interested in preparing a presentation with a couple of slides? Thanks.

@aarmejach
Copy link
Contributor

hi @ivanaamit, sure, I'll join the meeting. I have not thought much on the "in core" vs. "co-processor" execution model in gem5 to be honest, as it is not something we are considering at the moment.

Regarding this issue, we have implemented a solution in which vset{i}vl{i} instructions set the flag IsQuiesceFetch, which stalls the fetching/decoding of subsequent instructions up until execution of vset{i}vl{i} is completed (does not need to be committed, just complete the execution). At which point the pipeline can continue fetching/decoding with the updated vl and vtype inside the PC state.

We went this path as it required little changes and goes around the issue that vtype is necessary to decode subsequent instructions correctly. So I think it would solve #860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-riscv The RISC-V ISA
Projects
None yet
Development

No branches or pull requests

4 participants