Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jiahan CS 6120 Final Project Blog #423

Open
wants to merge 19 commits into
base: 2023fa
Choose a base branch
from

Conversation

jiahanxie353
Copy link
Contributor

Closes #410

In this final project, I worked with @michaelmaitland on supporting LLVM GlobalISel for the RISC-V vector extension on part of the vectorized ALU operations.


# Introduction

The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector registers have flexible widths, VLEN, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). The increasing vector lengths can pose a challenge to the traditional SIMD extensions as they have to address compatability and support all existing fixed size vector lengths in their ISAs. On the contrary, with the vector lengths agnostic principles, binary code generated by RISC-V assembly is automatically portable between different CPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RISC-V folks more commonly refer to VL as the unique part.

The vl register holds an unsigned integer specifying the number of elements to be updated with results from a vector instruction

This is in contrast to many SIMD approaches where the instruction pneumonic and different sizes of register impose a fixed number of elements processed. For example, the ARM Neon VADD.

On a given hardware implementation, the widths of registers in RISCV are flexible due to LMUL. This is also unique to RISCV.

The VLEN however, is fixed for that hardware implementation.


# Introduction

The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector registers have flexible widths, VLEN, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). The increasing vector lengths can pose a challenge to the traditional SIMD extensions as they have to address compatability and support all existing fixed size vector lengths in their ISAs. On the contrary, with the vector lengths agnostic principles, binary code generated by RISC-V assembly is automatically portable between different CPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a processor implements RISC-V V extension, the vector code is portable to other CPU that support the V extension. However, if a processor implements say the ARM Neon extension, then the NEON code is portable to other ARM processors that support the Neon extension.

The real power of vector length agnostic is when it comes to loops. Take the following example:

for (i = 0; i < N; i++)
  A[i] = B[i] + C[i]

On RISC-V, we can generate the following (pseudo) code

ph:
  i = 0
body:
  vsetvli (N-i) // set the VL as N-i, or VLMAX if N-i is too large for the hardware to process
  vadd.vv a_ptr, b_ptr, c_ptr
  i += VL // Increment I by the number of elements processed in this iteration
  branch_if_done exit, body // decide whether to exit loop
exit:
  ret

On a SIMD architecture, it is more complicated. You need to have two versions of the loop: the vector loop, and a scalar remainder loop. If N-i is smaller than the number of elements processed in the vector loop, then the scalar remainder loop needs to be executed. What is the best size for the number of elements to be processed in the vector loop? It depends on whether N will often be large or small. You may not know this ahead of time.

- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element. Each vector element can compose multiple single element.
- `VLEN`: Vector register LENgth (in bits). The number of bits in a single vector register. It is hardware dependent.
- `VL`: Vector element Length (in bits) that the programmers actually deal with, which can be treated as the vector operation building blocks. It defines how many elements the vector operations will execute.
- `LMUL`: The vector Length MULtiplier. It is used for grouping vector registers. It is a power of 2 and it ranges from 1/8 to 8. For instance, when `LMUL=8`, only `v0`, `v8`, `v16`, and `v24` indices are allowed to used, as for example, group `v8` encodes 8 vector elments `v8v9`...`v15`. Note it can also be fraction numbers because sometimes we want to use only parts of the vector registers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the ABI imposes the fact that v0, v8, v16 can be used, not the RISC-V Vector spec. You could have used v1, v9, v17. The reason the ABI does it the way it does is so that v17-v31 can be used in the caller/callee save scheme. The important part here is that on LMUL 8, the register vN acts as a grouping of vN, vN+1, ..., vN+7 registers. That forces us to partition what registers can be passed and preserve the expected semantics of the instruction.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelmaitland, just to clarify: one important constraint here (aside from the ABI) is that, when LMUL=8, you are not allowed to use v1, v2, v9 at all, right? That is, the only "allowed" registers are at indices $\mathit{LMUL} \cdot k$ for natural numbers $k$? (Otherwise accessing these unaligned registers is an error?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. According to the spec:

When LMUL=2, the vector register group contains vector register v n and vector register v n+1, providing twice the vector length in bits. Instructions specifying an LMUL=2 vector register group with an odd-numbered vector register are reserved.

When LMUL=4, the vector register group contains four vector registers, and instructions specifying an LMUL=4 vector register group using vector register numbers that are not multiples of four are reserved.

When LMUL=8, the vector register group contains eight vector registers, and instructions specifying an LMUL=8 vector register group using register numbers that are not multiples of eight are reserved.

I think my statement above that You could have used v1, v9, v17 is incorrect by this part of the spec.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it; thanks!

content/blog/2023-12-11-rvv-llvm-gisel/index.md Outdated Show resolved Hide resolved
content/blog/2023-12-11-rvv-llvm-gisel/index.md Outdated Show resolved Hide resolved
```
The complete chart can be found in [this `RISCV/RISCVRegisterInfo.td` file](https://github.com/llvm/llvm-project/blob/75d6795e420274346b14aca8b6bd49bfe6030eeb/llvm/lib/Target/RISCV/RISCVRegisterInfo.td). And note that `MF` stands for fractional `LMUL` and `M`s are integer `LMUL`s.

Some values are `None` because currently RISC-V vectors assume `VLEN=64`. Take the combination (`MF8`, `i16`) as an example. If we were to write it in terms of LLVM scalable vectors, it would be `nx1/2i16` ((64 x 1/8) / 16 = 1/2), which is illegal. Now consider a legal (`LMUL`, `SEW`) combination: (`i32`, `M4`). Since `VLEN` = 64 and `SEW` = 32, there are 2 basic block elements in a single vector element. And since the grouping factor is 4, there are 2*4 = 8 multiples of elements, hence `nxv8i32 == <vscale x 8 x i32>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about basic block being confusing here.

content/blog/2023-12-11-rvv-llvm-gisel/index.md Outdated Show resolved Hide resolved

Let's recall the result produced by the Register Bank Select pass: `%vc:vrb(<vscale x 1 x i8>) = G_ADD %va:vrb(<vscale x 1 x i8>), %vb:vrb(<vscale x 1 x i8>)`. We'd like to use the corresponding MIR of RISC-V vector add instruction to replace the generic add `G_ADD` instruction. Please note that I said "the corresponding MIR" because we will not be generating the actual RISC-V `vadd` or `vsetvli` in the current pass. It's because the process of instruction selection involves transforming code into target-specific MIR/machine instructions. Later down the pipeline, the `RISCVInsertVSETVLI` function, for example, will executed. Additionally, the `RISCVAsmPrinter` will translate MIR into MCInst at later stage, representing the final assembly language form. With that being said, what we actually want to get out of instruction selection pass is in this form: `%vc:vr = PseudoVADD_VV_MF8 %va, %vb, -1, 3 /* e8 */, 3 /* ta, ma */`, where `PseudoVADD_VV_MF8` is vector instruction pseudos for vector-vector add with `LMUL` = 1/8, the position where -1 stands is for the `VL` operand and -1 means `VLMAX`, the first 3 stands for `SEW` as log2(8) = 3, and the second 3 is the encoding for the policy tail agnostic and mask agnostic. RISC-V vector instruction pseudos in LLVM are essentially used for efficiently handling the complex, `vtype`-dependent behavior of vector instructions, such as in register allocation.

Implementation-wise, the [`select` function](https://llvm.org/doxygen/classllvm_1_1InstructionSelector.html#a50058a922d4f75ed765c34742c5066c6) is invoked, which in turns call [the corresponding RISC-V `selectImpl` function](https://github.com/llvm/llvm-project/blob/d96f46dd20157be9c11e16d8bdd3ebf900df41fc/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp#L56). To achieve this final phase, there are essentially four steps to take. First is to identify the vectorized opcode/gMIR; then we create the lowered version of that gMIR using the vector instruction pseudos; and we need to erase the old instruction once the lowered version has been picked; finally we choose a register from the register bank. For this final phase in GlobalISel, LLVM TableGen might generate `selectImpl` and we can use it out-of-the-box; otherwise, we need to implement extra logics to customize the selection pass outlined above. Luckily, TableGen does pick up and we only need implement some helper functions to mesh everything together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually choose a register from the register bank. We already have the register. In the case above, its the virtual register %vc. What we're really doing is constraining the type of this virtual register so that register allocation has enough information to convert the virtual register into a physical register.


Let's recall the result produced by the Register Bank Select pass: `%vc:vrb(<vscale x 1 x i8>) = G_ADD %va:vrb(<vscale x 1 x i8>), %vb:vrb(<vscale x 1 x i8>)`. We'd like to use the corresponding MIR of RISC-V vector add instruction to replace the generic add `G_ADD` instruction. Please note that I said "the corresponding MIR" because we will not be generating the actual RISC-V `vadd` or `vsetvli` in the current pass. It's because the process of instruction selection involves transforming code into target-specific MIR/machine instructions. Later down the pipeline, the `RISCVInsertVSETVLI` function, for example, will executed. Additionally, the `RISCVAsmPrinter` will translate MIR into MCInst at later stage, representing the final assembly language form. With that being said, what we actually want to get out of instruction selection pass is in this form: `%vc:vr = PseudoVADD_VV_MF8 %va, %vb, -1, 3 /* e8 */, 3 /* ta, ma */`, where `PseudoVADD_VV_MF8` is vector instruction pseudos for vector-vector add with `LMUL` = 1/8, the position where -1 stands is for the `VL` operand and -1 means `VLMAX`, the first 3 stands for `SEW` as log2(8) = 3, and the second 3 is the encoding for the policy tail agnostic and mask agnostic. RISC-V vector instruction pseudos in LLVM are essentially used for efficiently handling the complex, `vtype`-dependent behavior of vector instructions, such as in register allocation.

Implementation-wise, the [`select` function](https://llvm.org/doxygen/classllvm_1_1InstructionSelector.html#a50058a922d4f75ed765c34742c5066c6) is invoked, which in turns call [the corresponding RISC-V `selectImpl` function](https://github.com/llvm/llvm-project/blob/d96f46dd20157be9c11e16d8bdd3ebf900df41fc/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp#L56). To achieve this final phase, there are essentially four steps to take. First is to identify the vectorized opcode/gMIR; then we create the lowered version of that gMIR using the vector instruction pseudos; and we need to erase the old instruction once the lowered version has been picked; finally we choose a register from the register bank. For this final phase in GlobalISel, LLVM TableGen might generate `selectImpl` and we can use it out-of-the-box; otherwise, we need to implement extra logics to customize the selection pass outlined above. Luckily, TableGen does pick up and we only need implement some helper functions to mesh everything together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM TableGen always generates selectImpl. The "might" part has to do with whether there are TableGen patterns that exist to help us go from MachineInstr with GlobalIsel GenericOpcode -> SelectionDAG SDNode with SelectionDAG ISD opcode -> MachineInstr with RISC-V opcode.

SelectionDAG implements the ISD -> MachineInstr with RISC-V opcode transformation. In this case G_ADD maps pretty well onto ISD::ADD. So GlobalISel defines an equivalence.

As we spoke before, it would be better if we didn't go through SelectionDAG at all, but this approach makes it easy for architectures to onboard onto GISel, and hopefully in the future we can remove going through SelectionDAG entirely.


# Were You Successful?

This project is a success and we have become one of the first developers to support GlobalISel for the RISC-V vector extension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the first developers to support scalable vectors in GISEL for any target.

Copy link
Owner

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed and technical writeup! This sounds like a tremendous amount of work, and I'm glad you were able to make some progress. I have marked a few places where the post could be a tiny bit clearer, especially for outsiders who do not know that much about RVV.

One high-level ingredient I'd be interested in, if you feel like adding it (not a requirement): what fundamentally made this project interesting/difficult? What I mean is that you have identified several challenges that come from two sources:

  1. RVV stuff is hard to think about
  2. GlobalISel is somewhat complicated to extend

…but is there anything specific to the intersection between those two things? Like, is there anything about RVV in particular that makes global instruction selection harder? Or is this project approximately equal to the sum of its parts?


The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector registers have flexible widths, VLEN, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). The increasing vector lengths can pose a challenge to the traditional SIMD extensions as they have to address compatability and support all existing fixed size vector lengths in their ISAs. On the contrary, with the vector lengths agnostic principles, binary code generated by RISC-V assembly is automatically portable between different CPUs.

# What Was the Goal?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to include the bullet points from the syllabus as section headers in your blog post. This would flow a bit more naturally if you just transitioned directly into a recap of the goal as part of the intro. "In this project, our goal was to…"

The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector registers have flexible widths, VLEN, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). The increasing vector lengths can pose a challenge to the traditional SIMD extensions as they have to address compatability and support all existing fixed size vector lengths in their ISAs. On the contrary, with the vector lengths agnostic principles, binary code generated by RISC-V assembly is automatically portable between different CPUs.

# What Was the Goal?
The goal was to support LLVM Global Instruction Selection for RISC-V Vector Extensions on some ALU operations, such as `vadd`, `vsub`, `vand`, `vor`, and `vxor`.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be helpful to include a few words (not even a full sentence) here on what global isel is? I know you get into detail below, but a tiny overview here might help people understand the post's context.

The most important parameters are undoubtly `SEW`, `VLEN`, `VL`, and `LMUL`; and one of the most interesting and powerful instructions is `vset{i}vl{i}`.

Let's begin with the crucial parameters:
- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element. Each vector element can compose multiple single element.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means: "Each vector element can compose multiple single element." These sound like the same thing?


Let's begin with the crucial parameters:
- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element. Each vector element can compose multiple single element.
- `VLEN`: Vector register LENgth (in bits). The number of bits in a single vector register. It is hardware dependent.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to clarify "it is hardware dependent" here… earlier, you have said that this is flexible ("its vector registers have flexible widths, VLEN"), but here it seems to imply this is a hardware parameter. Maybe clarify which one is right?

Let's begin with the crucial parameters:
- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element. Each vector element can compose multiple single element.
- `VLEN`: Vector register LENgth (in bits). The number of bits in a single vector register. It is hardware dependent.
- `VL`: Vector element Length (in bits) that the programmers actually deal with, which can be treated as the vector operation building blocks. It defines how many elements the vector operations will execute.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify whether this must be less than or equal to VLEN?

Also, this bullet point seems to say that it is both in bits and in elements ("how many elements"). Can you clarify which is the actual quantity and which is implied?

content/blog/2023-12-11-rvv-llvm-gisel/index.md Outdated Show resolved Hide resolved



# What Were the Hardest Parts?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more chance to retitle your sections to be more meaningful to the outside world.


# What Were the Hardest Parts?

Definitely learning the whole LLVM and its GlobalISel infrastructure, and it was also hard to understand the vector length agnostic features/instructions in the RISC-V vector extension.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a complete sentence? Again, imagine that the reader is someone who cares about the topic but does not care that this was a 6120 course project.

content/blog/2023-12-11-rvv-llvm-gisel/index.md Outdated Show resolved Hide resolved

Learning the RISC-V vector extension was also a headache at the beginning because I had to figure out the difference between RISC-V vector extension with standard SIMD vector instructions. Learning the semantic meaning of `vsetvli`, differentiating the concepts of `ELEN`, `VLEN`, and how `SEW`, `LMUL`, `VLMAX` come into play was also confusing.

# Were You Successful?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this doesn't need a separate section heading from the above discussion?

@jiahanxie353
Copy link
Contributor Author

Thanks for all the detailed and constructive feedback @sampsyo @michaelmaitland !

I tried my best to answer the questions. And please let me know if there's anything needed to be further supported/clarified!

Copy link
Contributor

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small round with minor verbiage changes to be more specific. I think this is the last round of changes I have before approval.


# Introduction

The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector instructions can deal with flexible vector lengths, VL, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). Traditional SIMD extensions with fixed vector lengths face challenges when dealing with the changing data sizes. They must maintain compatibility and support all existing fixed-size vector lengths in their instruction set architectures. This often leads to inefficiencies, especially in loop operations where the data size/loop stride may not align perfectly with the fixed vector size, necessitating additional scalar processing for the remaining elements. And the most suitable size for the number of elements to be processed in the vector loop is hard to decide ahead of time. In contrast, the RISC-V Vector Extension eliminate this concern with its vector length agnostic principle. Particularly in loop scenarios, the RISC-V's ability to adaptively handle varying data sizes stands out. For instance, in a simple loop adding two arrays, the RISC-V can dynamically adjust the vector length for each iteration by dynamically setting the vector length. This means it can process as many elements as possible in each pass, depending on the hardware capabilities and the remaining data. This adaptive approach really simplifies the code by eliminating the need for separate scalar loops for the leftover elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: often -> may


# Introduction

The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector instructions can deal with flexible vector lengths, VL, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). Traditional SIMD extensions with fixed vector lengths face challenges when dealing with the changing data sizes. They must maintain compatibility and support all existing fixed-size vector lengths in their instruction set architectures. This often leads to inefficiencies, especially in loop operations where the data size/loop stride may not align perfectly with the fixed vector size, necessitating additional scalar processing for the remaining elements. And the most suitable size for the number of elements to be processed in the vector loop is hard to decide ahead of time. In contrast, the RISC-V Vector Extension eliminate this concern with its vector length agnostic principle. Particularly in loop scenarios, the RISC-V's ability to adaptively handle varying data sizes stands out. For instance, in a simple loop adding two arrays, the RISC-V can dynamically adjust the vector length for each iteration by dynamically setting the vector length. This means it can process as many elements as possible in each pass, depending on the hardware capabilities and the remaining data. This adaptive approach really simplifies the code by eliminating the need for separate scalar loops for the leftover elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the RISC-V -> the RISC-V program


The open [RISC-V instruction set architecture (ISA)](https://riscv.org/technical/specifications/) has an interesting extension, [the RISC-V "V" Vector Extension](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). The unique part of RISC-V Vector Extension is that its vector instructions can deal with flexible vector lengths, VL, which makes programming in RISC-V Vector Extension agnostic to the vector register sizes. This feature really distinguishes the RISC-V Vector Extension from the traditional SIMD extensions, such as [x86 SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) (with a fixed size 128-bit vector length)/[AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) (256-bit), and [Arm NEON](https://developer.arm.com/Architectures/Neon) (128-bit). Traditional SIMD extensions with fixed vector lengths face challenges when dealing with the changing data sizes. They must maintain compatibility and support all existing fixed-size vector lengths in their instruction set architectures. This often leads to inefficiencies, especially in loop operations where the data size/loop stride may not align perfectly with the fixed vector size, necessitating additional scalar processing for the remaining elements. And the most suitable size for the number of elements to be processed in the vector loop is hard to decide ahead of time. In contrast, the RISC-V Vector Extension eliminate this concern with its vector length agnostic principle. Particularly in loop scenarios, the RISC-V's ability to adaptively handle varying data sizes stands out. For instance, in a simple loop adding two arrays, the RISC-V can dynamically adjust the vector length for each iteration by dynamically setting the vector length. This means it can process as many elements as possible in each pass, depending on the hardware capabilities and the remaining data. This adaptive approach really simplifies the code by eliminating the need for separate scalar loops for the leftover elements.

In this project, our goal was to support LLVM Global Instruction Selection (GlobalISel), a framework that operates on whole function for instruction selection, for the RISC-V Vector Extension on some ALU operations, such as `vadd`, `vsub`, `vand`, `vor`, and `vxor`. Apart from adding support for RISC-V vector types and operations for GlobalISel by going down GlobalISel's pipeline, it's a challenge to bridge the LLVM world (concepts like scalable vector) and the RISC-V world (concepts like vector length and vector register grouping factor) together. And we will showcase how we address the challenge in the following sections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function -> functions

The most important parameters are undoubtly `SEW`, `VLEN`, `VL`, and `LMUL`; and one of the most interesting and powerful instructions is `vset{i}vl{i}`.

Let's begin with the crucial parameters:
- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element/register. Each vector element can compose `VLEN`/`SEW` single elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each vector element can compose `VLEN`/`SEW` single elements. -> Each vector can contain `VLEN`/`SEW elements.


Let's begin with the crucial parameters:
- `SEW`: Selected Element Width (in bits), set dynamically by the programmers. It sets the width/length of a single element in a vector element/register. Each vector element can compose `VLEN`/`SEW` single elements.
- `ELEN`: The maximum size in bits of a vector element that any operation can produce or consume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ELEN is missing from list on line 25 above

- `LMUL`: The vector Length MULtiplier. It is used for grouping vector registers. It is a power of 2 and it ranges from 1/8 to 8. For instance, when `LMUL=8`, the ABI imposes that only `v0`, `v8`, `v16`, and `v24` indices are allowed to used, as for example, group `v8` encodes 8 vector elments `v8v9`...`v15`. Note it can also be fraction numbers because sometimes we want to use only parts of the vector registers.

After introducing these basic and the most important parameters, there are still two paramters we will be dealing with, `AVL` and `VLMAX`:
- `AVL`: Application Vector Length. The application specifies the total number of elements to be processed as a candidate for `VL`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You introduce AVL above when you talk about VL. Maybe add a note above to (see below about AVL) so the reader knows you will explain AVL.

# A Primer on LLVM Global Instruction Selection
LLVM Global Instruction Selection ([GlobalISel](https://llvm.org/docs/GlobalISel/index.html)) is a framework that provides a set of reusable passes and utilities for instruction selection — translation from LLVM IR to target-specific Machine IR (MIR). It is "global" in the sense that it operates on the whole function rather than a single basic block.

GlobalISel is intended to be a replacement for [SelectionDAG](https://llvm.org/docs/CodeGenerator.html#introduction-to-selectiondags) and [FastISel](https://llvm.org/doxygen/classllvm_1_1FastISel.html), to solve performance, granularity, and modularity problems. GlobalISel does not need to introduce a new dedicated IR as in SelectionDAG so GlobalISel can provide faster code generation; GlobalISel operates on a function, whereas SelectionDAG only considers a basic block, losing some global optimization opportunities; in addition, GlobalISel enables more code reuse for instruction selection for different targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlobalISel does not need to introduce a new dedicated IR as -> GlobalISel does not introduce a new dedicated IR since it works on the already existing and documented MIR, compared to SelectionDAG which has its own ISD representation.

Also, are you sure that not having a separate IR is a reason for the faster code generation?

# A Primer on LLVM Global Instruction Selection
LLVM Global Instruction Selection ([GlobalISel](https://llvm.org/docs/GlobalISel/index.html)) is a framework that provides a set of reusable passes and utilities for instruction selection — translation from LLVM IR to target-specific Machine IR (MIR). It is "global" in the sense that it operates on the whole function rather than a single basic block.

GlobalISel is intended to be a replacement for [SelectionDAG](https://llvm.org/docs/CodeGenerator.html#introduction-to-selectiondags) and [FastISel](https://llvm.org/doxygen/classllvm_1_1FastISel.html), to solve performance, granularity, and modularity problems. GlobalISel does not need to introduce a new dedicated IR as in SelectionDAG so GlobalISel can provide faster code generation; GlobalISel operates on a function, whereas SelectionDAG only considers a basic block, losing some global optimization opportunities; in addition, GlobalISel enables more code reuse for instruction selection for different targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition, GlobalISel enables more code reuse for instruction selection for different targets.

Can you cite your source here?

```
The complete chart can be found in [this `RISCV/RISCVRegisterInfo.td` file](https://github.com/llvm/llvm-project/blob/75d6795e420274346b14aca8b6bd49bfe6030eeb/llvm/lib/Target/RISCV/RISCVRegisterInfo.td). And note that `MF` stands for fractional `LMUL` and `M`s are integer `LMUL`s.

Some values are `None` because currently the LLVM community assumes the RISC-V vectors to have `VLEN=64`. Take the combination (`1/8`, `i16`) as an example. If we were to write it in terms of LLVM scalable vectors, it would be `nx1/2i16` ((64 x 1/8) / 16 = 1/2), which is illegal. Now consider a legal (`LMUL`, `SEW`) combination: (`i32`, `4`). Since `VLEN` = 64 and `SEW` = 32, there are 64/32 = 2 elements that can fit in a single vector element. And since the grouping factor is 4, there are 2*4 = 8 multiples of elements, hence `nxv8i32 == <vscale x 8 x i32>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some values are N/A, not None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Project Proposal: Support LLVM GlobalISel for RISC-V Vector Extension
3 participants