Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zcb: Code Size Reduction Extension #3362

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

Abdulwadoodd
Copy link

@Abdulwadoodd Abdulwadoodd commented May 15, 2023

Type of change: other enhancement

Impact: API addition (no impact on existing code)

Development Phase: implementation

Background
RISC-V code size reduction extension supersedes the existing standard C (compressed) extension. It divides instructions in the C extension into three sub-extensions (Zca, Zcf, and Zcd) and exploits the reserved encodings in C to add more instructions in three sub-extensions (Zcb, Zcmp and Zcmt).
RocketCore already supports C-extension i.e. Zca, Zcf, and Zcd. The un-privilege spec classifies compressed instructions into three different parts: Quadrant 0, Quadrant 1, and Quadrant 2. This classification is done based on the Opcode of compressed instruction: bits[1:0].

Implementation
Zcb instructions fall into Quadrant 0 and Quadrant 1 having opcodes 2'b00 and 2'b01 respectively. Hence, Implementation is done in two parts:

Part 1:
In part 1, the Opcodes of the instructions belonging to Quadrant 0 are added, and the corresponding 16-bit->32-bit decode logic is hooked up with Q0 logic for the existing instructions. The instructions in this part are the compressed version of I-type load-store instructions: lbu, lh, lhu, sb, sh.

Part 2:
In part 2, the Opcodes of the instructions belonging to Quadrant 1 are added, and the corresponding 16-bit->32-bit decode logic is hooked up with Q1 logic for the existing instructions. The instructions in this part are the has dependency on I-type, M-type, and Bit-manip (Zba, Zbb) instructions: c.zext.b, c.sext.b, c.zext.h, c.sext.h, c.zext.w, c.mul, c.not.

Verification
A simple assembly program is written including all the Zcb instructions to test their functionality. It is available here. It is a bit crude way of testing in the early stage. A plan to develop more mature tests covering the corner cases is in the pipeline which will come as part of riscv-arch tests.

Release Notes
This is one of three PR as part of the GSoC'23 project under the mentorship Hongren Zheng, to add Zc* support in Rocket. This PR adds hardware implementation to support the Zcb extension which is one of the variants of the recently ratified code size reduction extension. RocketCore has the infrastructure for the uncompressed version of Zcb instructions. So, mainly there are changes implemented in RVCDecoder to decode the compressed Zcb instruction into its 32-bit equivalent instructions. The implementation works for both 32-bit and 64-bit architectures.

Acknowledgement
Thanks to @ZenithalHourlyRate for his guidance and help.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 15, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@ZenithalHourlyRate ZenithalHourlyRate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. Have not checked the detail of the RTL implementation against the spec. Some other small changes are needed.

First, this should be made against the dev branch, it will then be backported the to master branch. @sequencer should we add a contribution guide in the PR template as there are multiple contributors that have fallen into this pitfall?

As for the misc changes, they should be introduced in another PR.

You should also add zcb to the dts. zca should also be included. In the future when you have implemented all Zc*, a zce should be added.

def sd = inst(Cat(ldImm >> 5, rs2p, rs1p, 3.U(3.W), ldImm(4,0), 0x23.U(7.W)), rs2p, rs1p, rs2p)
def sw = inst(Cat(lwImm >> 5, rs2p, rs1p, 2.U(3.W), lwImm(4,0), 0x23.U(7.W)), rs2p, rs1p, rs2p)
def fsd = inst(Cat(ldImm >> 5, rs2p, rs1p, 3.U(3.W), ldImm(4,0), 0x27.U(7.W)), rs2p, rs1p, rs2p)
def fsw = {
if (xLen == 32) inst(Cat(lwImm >> 5, rs2p, rs1p, 2.U(3.W), lwImm(4,0), 0x27.U(7.W)), rs2p, rs1p, rs2p)
else sd
}
Seq(addi4spn, fld, lw, flw, unimp, fsd, sw, fsw)
Seq(addi4spn, fld, lw, flw, zcb_q0, fsd, sw, fsw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zcb should be populated with conditions

For example, c.zext.b should be added only when usingBitManip, and c.mul should be populated only when usingMulDiv, zext.w should be added only when RV64. This could save a tiny bit of area.

Other insts like c.lbu and c.not will always be added as I or E are always present. This default behavior will add area overhead for end users as they may not use it. I think we provide an option like usingCompressedBitManip to control this behavior. Asking @jerryz123 for advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mistake. Should @jerryz123.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters to the RVCDecode object should be added that disable the Zcb/Zcmp/Zcmt extensions if not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jerryz123 , @ZenithalHourlyRate
In order to enable/disable Zcb, usingCompressedSuiteB parameter is added and routed (along with usingMulDiv and usingBitManip) to RVCDecoder to apply the condition on the corresponding instructions.

src/main/scala/rocket/Instructions.scala Show resolved Hide resolved
@Abdulwadoodd Abdulwadoodd changed the base branch from master to dev May 17, 2023 22:14
@Abdulwadoodd Abdulwadoodd changed the base branch from dev to master May 17, 2023 22:15
@Abdulwadoodd Abdulwadoodd force-pushed the Zcb_dev branch 2 times, most recently from 4fd9882 to 245c770 Compare May 21, 2023 03:11
@Abdulwadoodd Abdulwadoodd changed the base branch from master to dev May 21, 2023 03:11
- Zcb Instructions with `OP[1:0] = 2'b0` added.
- These are Compressed version of following I-type instructins.
	- lbu
	- lhu
	- lh
	- sb
	- sh
- These instructions lies in Quadrant 0 based on compressed Opcode.

Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
- Implement the 16 to 32 bit Decoder for `Zcb` Instructions having OP[1:0] == 00
- This decodes the following compressed instructions into its 32-bit equivalent instructions.
	c.lbu, c.lhu, c.lh, c.cb, c.sh

Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
- Zcb Instructions with `OP[1:0] = 2'b01` added.
- These instructions lies in Q1 of Compressed Instructions.

Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
- Implement the 16 to 32 bit Decoder for `Zcb` Instructions having OP[1:0] = 2'b01
- The 16 bit -> 32 bit instruction mapping is given as follows:
	c.zext.b -> andi
	c.sext.b -> sext.b   (require Zbb)
	c.zext.h -> zext.h   (require Zbb)
	c.sext.h -> sext.h   (require Zbb)
	c.zext.w -> add.uw   (require Zba, only for RV64)
	c.not    -> xori
	c.mul    -> mul      (Require M standard extension)

Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
@Abdulwadoodd
Copy link
Author

Hi @jerryz123 @ZenithalHourlyRate

  • Comments are added in RVC.scala describing the purpose of added parameters in RVCDecoder.
  • zca, zcb, zcd, zcf are updated in DTS.
  • After rebase, updates the base branch from master to dev.

Please let me know if additional changes are required.

Copy link
Contributor

@ZenithalHourlyRate ZenithalHourlyRate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the non-circuit part now. Need more time to check the circuit part.

src/main/scala/rocket/RVC.scala Outdated Show resolved Hide resolved
if(usingCompressedSuiteB){
def zextb = inst(Cat(0xff.U, rs1p, 7.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
def not = inst(Cat(0xFFF.U, rs1p, 4.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
def zexth64 = inst(Cat(0x80.U, rs1p, 4.U(3.W), rs1p, 0x3B.U(7.W)), rs1p, rs1p, rs2p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: I believe this should defined in zexth?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the existing convention as a selection between fs and sd based on the xlen.
Anyways, it is updated as you suggested

Option.when(tileParams.core.useCompressed)(Seq("zca")) ++
Option.when(tileParams.core.useCompressedSuiteB)(Seq("zcb")) ++
Option.when(tileParams.core.useCompressed && tileParams.core.fpu.nonEmpty && tileParams.core.fpu.get.fLen > 32)(Seq("zcd")) ++
Option.when(tileParams.core.useCompressed && tileParams.core.fpu.nonEmpty)(Seq("zcf")) ++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently for rocket when fpu.nonEmpty F-ext will always be enabled

val insns = (minFLen, fLen) match {
case (32, 32) => f
case (16, 32) => h ++ f
case (32, 64) => f ++ d
case (16, 64) => h ++ f ++ d ++ fcvt_hd

But the consistency is not kept everywhere, e.g. misa has a different way

val f = if (tileParams.core.fpu.nonEmpty) "f" else ""

(if (fLen >= 32) "F" else "") +

To reduce such inconsistency, I believe we should finish this TODO? Are you interested in this TODO? If so, things should be done in another PR.

// TODO merge with isaString in CSR.scala

src/main/scala/rocket/RVC.scala Outdated Show resolved Hide resolved
}
inst(Seq(srli, srai, andi, rtype)(x(11,10)), rs1p, rs1p, rs2p)
Seq(srli, srai, andi, rtype)(x(11,10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such change is costly. You now have a much wider Mux and the user will have a larger area even the user does not enable usingCompressedSuiteB

Is there any special reason for it? The x0 in zextw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special reason for it? The x0 in zextw?

Yes, this is the reason.

the user will have a larger area even the user does not enable usingCompressedSuiteB

I'm a little skeptical about it, and I have following two reasons to back my argument:

  1. As per my (vague) understanding of hardware (Verilog) generation using CHISEL, computation is done in software by keeping into account the parameters, and then the Verilog is generated including (only) that hardware part that was allowed by the parameter.
  2. The above can be verified by observing the generated Verilog (by looking at the reference at the end of each line of generated Verilog that points to the corresponding scala file and its line number). For example, when I build Rocket with usingCompressedSuiteB = false, the generated Verilog simply excludes the corresponding definition of 16->32 instructions. But when I build using usingCompressedSuiteB = true, generated Verilog does include the references of the definition of those instructions which are defined under the if clause of this parameter.

So, this should not increase the area if usingCompressedSuiteB is disabled, unless I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further details should be checked. There are two steps of selection, static selection like if/elses in scala and dynamic selection like x(11,10) in the circuit.

The DefaultConfig in dev branch

// the final mux
  assign io_out_bits = _GEN_19[_io_out_T_2];
  assign io_out_rd = _GEN_20[_io_out_T_2];
  assign io_out_rs1 = _GEN_21[_io_out_T_2];
  assign io_out_rs2 = _GEN_22[_io_out_T_2];
  assign io_out_rs3 = _GEN_23[_io_out_T_2];
  
// lets check rs1
  wire [31:0][4:0]  _GEN_21 =
  ...
    {io_out_s_12_rs1}, // this coresponds to arith in scala
  ...
  
  assign io_out_s_12_rs1 = {2'h1, io_in[9:7]};

DefaultConfig in this branch

// the final mux
  assign io_out_bits = _GEN_23[_io_out_T_2];
  assign io_out_rd = _GEN_24[_io_out_T_2];
  assign io_out_rs1 = _GEN_25[_io_out_T_2];
  assign io_out_rs2 = _GEN_26[_io_out_T_2];
  assign io_out_rs3 = _GEN_27[_io_out_T_2];
  
// lets check rs1
  wire [31:0][4:0]  _GEN_25 =
  ...
    {io_out_s_12_rd}, // this coresponds to arith in scala
  ...
  
  wire [4:0]        io_out_s_12_rs1 = _GEN_8[io_in[11:10]]; // extra mux
  
  wire [3:0][4:0]   _GEN_8 =
    {{(&_io_out_s_T_235) ? io_out_s_res_4_rs1 : io_out_s_res_5_rs1},
     {io_out_s_res_3_rs1},
     {io_out_s_res_2_rs1},
     {io_out_s_res_1_rs1}};
     
  assign io_out_s_res_4_rs1 = {2'h1, io_in[9:7]};
  assign io_out_s_res_5_rs1 = {2'h1, io_in[9:7]};
  assign io_out_s_res_3_rs1 = {2'h1, io_in[9:7]};
  assign io_out_s_res_2_rs1 = {2'h1, io_in[9:7]};
  assign io_out_s_res_1_rs1 = {2'h1, io_in[9:7]};

Chisel is faithful of generating what you write. It might be argued that the synthesizer will optimize this out, but I think avoiding this from source is a better option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, I understand now. Pushed fixes in the latest commit.

}
else unimp
}
Seq(zextb, sextb, zexth, sexth, zextw, not)(x(4,2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When usingCompressedSuiteB but not usingBitManip, this will become Mux between a lot of unimp. Is there better way to do it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not all of these (zextb, sextb, zexth, sexth, zextw, not) become Mux. The problem is that only sextb, zexth, sexth, zextw depends upon the usingBitManip parameter and zextb and not are decoded in I-type instructions. So, if we keep the Seq statement constant in the end, then EITHER sextb, zexth, sexth, zextw needs to be decoded if usingBitManip is enabled OR each one of these must corresponds tounimp if usingBitManip is disabled. Hence, comes up these bad if/else statements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is something like this

          if(usingCompressedSuiteB){
            def sexth = inst(Cat(0x605.U, rs1p, 1.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
            }
            def zextw = inst(Cat(4.U, x0, rs1p, 0.U(3.W), rs1p, 0x3B.U(7.W)), rs1p, rs1p, x0)
            }
            def zexth = {
                val zexth_common = Cat(0x80.U, rs1p, 4.U(3.W), rs1p)
                if (xLen == 32) inst(Cat(zexth_common, 0x33.U(7.W)), rs1p, rs1p, rs2p)
                else inst(Cat(zexth_common, 0x3B.U(7.W)), rs1p, rs1p, rs2p)
              }
            }
        
            def zextb = inst(Cat(0xFF.U, rs1p, 7.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
            def not   = inst(Cat(0xFFF.U, rs1p, 4.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
            def sextb = inst(Cat(0x604.U, rs1p, 1.U(3.W), rs1p, 0x13.U(7.W)), rs1p, rs1p, rs2p)
            if (usingBitManip) {
              Seq(zextb, sextb, zexth, sexth, zextw, not)(x(4,2))
            } else {
              Mux(x(4,2) === 5.U, not, Mux(x(4,2) === 0.U, zextb, unimp))
            }
          }

And like the comment in arith, try to avoid selecting between a lot of rs1p/rs2p

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the changes as suggested. Thanks

- Added `usingCompressedSuiteB` to enable/disable `Zcb` extension.
- This parameter depends upon existing `usingCompressed` parameter.
- By default `Zcb` is disable in RocketCore.
- Passed this parameter along with `usingBitManip` and `usingMulDiv` to RVCDecoder enable/disable the decoding of corresponding `Zcb` instructions.

Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
Signed-off-by: Abdul Wadood <abdulwadood.afzal88@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants