

# **Learn RISC-V CPU Implementation and BSV**

## **(BSV: a High-Level Hardware Design Language)**

Rishiyur S. Nikhil

Bluespec, Inc.



© 2023-2024 Rishiyur S.Nikhil

**DRAFT:** April 8, 2024  
Please do not circulate without the author's permission.



## Acknowledgements: BSV

The original ideas for a rule-based computation model for hardware design were developed at MIT by James Hoe (currently Professor, Carnegie Mellon University) and Arvind (Professor, MIT). The ideas behind CRegs (concurrent registers) were also developed later at MIT by Daniel Rosenband and Arvind.

The idea of embedding this computation model in a Haskell-like language (including types, typeclasses, static elaboration computation model, higher-order programming, and monadic static elaboration) was due to Lennart Augustsson.

The embedding in SystemVerilog-like syntax, technical refinement of the formal semantics of BSV, extension to multiple clock domains and implementation and improvements of the *bsc* compiler is due to several employees (including the author) of Sandburst, Inc. and Bluespec, Inc. since 2003.

Thanks to all the employees of Bluespec, Inc. and to all the users of BSV over the years—commercial, academic and independent—for their feedback and insights on BSV and how to teach it.

Thanks to Bluespec, Inc., for agreeing to open-source BSV with its *bsc* compiler and libraries in 2020.

## Acknowledgements: RISC-V

Thanks to the team from Universithy of California, Berkeley led by Professor Krste Asanović for their tremendous gift to the world of the free and open RISC-V specification, as a sophisticated, industrial-strength, ISA.

Thanks to Bluespec, Inc., for supporting the author in creating previous RISC-V CPU BSV designs (Flute, Piccolo, and Magritte), C simulators for RISC-V (Cissr V1 and V2) and FPGA-based RISC-V Systems (including Catamaran), all of which have greatly informed the new designs Fife and Drum in this book.



# Short Table of Contents

|                                                                                                                     |       |
|---------------------------------------------------------------------------------------------------------------------|-------|
| <b>Detailed Table of Contents</b>                                                                                   | v     |
| <b>List of Figures</b>                                                                                              | xiii  |
| <b>1 Introduction</b>                                                                                               | 1-1   |
| <b>2 Overview of the RISC-V ISA</b>                                                                                 | 2-1   |
| <b>3 RISC-V interpreters: the Design Space<br/>from Software Functional Simulators to High-Performance Hardware</b> | 3-1   |
| <b>4 BSV: Top-level view of a BSV program</b>                                                                       | 4-1   |
| <b>5 BSV: Combinational circuits for the RISC-V step functions</b>                                                  | 5-1   |
| <b>6 BSV: Struct types, tuples, and<br/>RISC-V: Memory requests and responses</b>                                   | 6-1   |
| <b>7 RISC-V: Core functions for RISC-V ISA execution (used in Drum and Fife)</b>                                    | 7-1   |
| <b>8 BSV: Modules and Interfaces: Registers, Register Files and FIFOs</b>                                           | 8-1   |
| <b>9 RISC-V: Modules for GPRs and CSRs</b>                                                                          | 9-1   |
| <b>10 BSV: FSMs</b>                                                                                                 | 10-1  |
| <b>11 RISC-V: the Drum unpipelined CPU (an FSM)</b>                                                                 | 11-1  |
| <b>12 BSV: Verifying BSV designs</b>                                                                                | 12-1  |
| <b>13 RISC-V: Functional verification of CPUs</b>                                                                   | 13-1  |
| <b>14 BSV: Rules and their Semantics</b>                                                                            | 14-1  |
| <b>15 RISC-V: the Drum unpipelined CPU (using Rules instead of StmtFSM)</b>                                         | 15-1  |
| <b>16 RISC-V: the Fife pipelined CPU: Principles</b>                                                                | 16-1  |
| <b>17 RISC-V: the Fife pipelined CPU code</b>                                                                       | 17-1  |
| <b>18 BSV: Rules and Methods II: Improved performance with CRegs (Concurrent Registers)</b>                         | 18-1  |
| <b>19 RISC-V: Optimizing Drum and Fife</b>                                                                          | 19-1  |
| <b>20 BSV: Suggested further study</b>                                                                              | 20-1  |
| <b>21 RISC-V: Suggested further study</b>                                                                           | 21-1  |
| <b>A Resources: Documents and Tools</b>                                                                             | A-1   |
| <b>B Why BSV?</b>                                                                                                   | B-1   |
| <b>C Glossary</b>                                                                                                   | C-1   |
| <b>D BSV: Importing C/C++ functions into BSV simulations</b>                                                        | D-1   |
| <b>BSV Index</b>                                                                                                    | BIB-1 |
| <b>RISC-V Index</b>                                                                                                 | BIB-1 |
| <b>Bibliography</b>                                                                                                 | BIB-1 |



# Detailed Table Of Contents

|                                                                                     |             |
|-------------------------------------------------------------------------------------|-------------|
| <b>Detailed Table of Contents</b>                                                   | <b>v</b>    |
| <b>List of Figures</b>                                                              | <b>xiii</b> |
| <b>1 Introduction</b>                                                               | <b>1-1</b>  |
| 1.0.1 Drum and Fife, the two RISC-V implementations designed in this book . . . . . | 1-3         |
| 1.0.2 Drum and Fife source codes . . . . .                                          | 1-4         |
| 1.0.3 Additional Resources . . . . .                                                | 1-4         |
| <b>2 Overview of the RISC-V ISA</b>                                                 | <b>2-1</b>  |
| 2.1 Introduction . . . . .                                                          | 2-1         |
| 2.2 What is an ISA? . . . . .                                                       | 2-1         |
| 2.3 Why choose RISC-V? . . . . .                                                    | 2-3         |
| 2.4 Overview of the RISC-V ISA . . . . .                                            | 2-4         |
| 2.5 Instruction encodings . . . . .                                                 | 2-6         |
| 2.6 Unprivileged ISA RV32I . . . . .                                                | 2-7         |
| 2.6.1 “Upper Immediate” instructions LUI and AUIPC . . . . .                        | 2-7         |
| 2.6.2 Conditional BRANCH instructions . . . . .                                     | 2-9         |
| 2.6.3 LOAD and STORE memory-access instructions . . . . .                           | 2-10        |
| 2.6.4 Register-Register Arithmetic and Logic instructions . . . . .                 | 2-10        |
| 2.6.5 Register-Immediate Arithmetic and Logic instructions . . . . .                | 2-11        |
| 2.6.6 Unconditional Jump instructions . . . . .                                     | 2-11        |
| 2.6.7 FENCE . . . . .                                                               | 2-12        |
| 2.7 Traps due to illegal instructions and other exceptions, and CSRs . . . . .      | 2-12        |
| 2.7.1 ECALL and EBREAK instructions, and Interrupts . . . . .                       | 2-14        |
| 2.7.2 CSRRxx instructions . . . . .                                                 | 2-14        |
| 2.7.2.1 CSRs INSTRET and INSTRETH . . . . .                                         | 2-15        |
| 2.7.2.2 CSRs CYCLE and CYCLEH . . . . .                                             | 2-16        |
| 2.7.2.3 CSRs TIME and TIMEH, and memory-mapped location MTIME . . . . .             | 2-16        |
| 2.7.2.4 Measuring CPU performance . . . . .                                         | 2-16        |
| 2.8 RV64I differences from RV32I . . . . .                                          | 2-17        |
| 2.9 Continued Evolution of the RISC-V ISA (with your contribution?) . . . . .       | 2-18        |
| <b>3 RISC-V interpreters: the Design Space</b>                                      |             |
| from Software Functional Simulators to High-Performance Hardware                    | <b>3-1</b>  |
| 3.1 The RISC-V designs in this book . . . . .                                       | 3-2         |
| 3.2 Abstract algorithm for interpreting an ISA . . . . .                            | 3-3         |
| 3.2.1 Memory latency and split-phase memory transactions . . . . .                  | 3-4         |
| 3.3 Plan for the order in which we tackle topics . . . . .                          | 3-5         |
| <b>4 BSV: Top-level view of a BSV program</b>                                       | <b>4-1</b>  |
| 4.1 Introduction . . . . .                                                          | 4-1         |
| 4.2 Packages and files . . . . .                                                    | 4-1         |
| 4.2.1 What’s in a Package? . . . . .                                                | 4-2         |
| 4.2.2 Visibility of names, exports and imports . . . . .                            | 4-3         |
| 4.2.3 Resolving ambiguous imports . . . . .                                         | 4-4         |
| 4.2.4 Exporting types abstractly . . . . .                                          | 4-4         |
| 4.3 Interface and Module Declarations . . . . .                                     | 4-4         |
| 4.3.1 What’s in an interface declaration? . . . . .                                 | 4-4         |

|         |                                                                                                             |      |
|---------|-------------------------------------------------------------------------------------------------------------|------|
| 4.3.2   | What's in a module declaration? . . . . .                                                                   | 4-5  |
| 4.4     | Rules and Interface Definitions . . . . .                                                                   | 4-6  |
| 4.4.1   | What's in a rule? . . . . .                                                                                 | 4-6  |
| 4.4.2   | What's in an interface definition? . . . . .                                                                | 4-6  |
| 4.5     | Static Elaboration and Hardware Module Structure . . . . .                                                  | 4-7  |
| 4.5.1   | Module interaction <i>via</i> methods . . . . .                                                             | 4-8  |
| 4.6     | Conclusion . . . . .                                                                                        | 4-9  |
| 5       | <b>BSV: Combinational circuits for the RISC-V step functions</b>                                            | 5-1  |
| 5.1     | Introduction . . . . .                                                                                      | 5-1  |
| 5.2     | Bit Vectors . . . . .                                                                                       | 5-2  |
| 5.2.1   | Built-in Operators on Bit Vectors . . . . .                                                                 | 5-2  |
| 5.3     | Integer types . . . . .                                                                                     | 5-3  |
| 5.4     | Hexadecimal and Binary Notation for literal integers . . . . .                                              | 5-4  |
| 5.5     | Boolean values . . . . .                                                                                    | 5-4  |
| 5.5.1   | Caution: <code>Bool</code> and <code>Bit#(1)</code> are different types . . . . .                           | 5-5  |
| 5.5.2   | Example: recognizing legal RISC-V BRANCH instructions . . . . .                                             | 5-5  |
| 5.5.3   | Combinational circuits and primitives . . . . .                                                             | 5-6  |
| 5.5.3.1 | Combinational circuits have no side-effects (are “pure”) . . . . .                                          | 5-6  |
| 5.6     | Functions . . . . .                                                                                         | 5-7  |
| 5.6.1   | Pure functions vs. functions with side-effects ( <code>Action</code> , <code>ActionValue</code> ) . . . . . | 5-7  |
| 5.6.2   | Combinational circuits = “doesn’t have <code>Action</code> or <code>ActionValue</code> type” . . . . .      | 5-8  |
| 5.6.3   | Using <code>ActionValue</code> on pure functions for <code>\$display</code> debugging . . . . .             | 5-9  |
| 5.7     | A small testbench to test our code . . . . .                                                                | 5-9  |
| 5.8     | <code>enum</code> types . . . . .                                                                           | 5-11 |
| 5.8.1   | <code>deriving (Bits)</code> . . . . .                                                                      | 5-12 |
| 5.8.2   | <code>deriving (Eq)</code> . . . . .                                                                        | 5-12 |
| 5.8.3   | <code>deriving (FShow)</code> . . . . .                                                                     | 5-12 |
| 5.9     | Syntax of Identifiers . . . . .                                                                             | 5-12 |
| 5.10    | Syntax of comments . . . . .                                                                                | 5-13 |
| 5.11    | if-then-else statements and hardware multiplexers . . . . .                                                 | 5-13 |
| 5.11.1  | Parallel multiplexers and MUX synthesis . . . . .                                                           | 5-15 |
| 5.12    | Case-expressions . . . . .                                                                                  | 5-17 |
| 5.13    | Sharing code for RV32 and RV64 <i>via</i> parameterization . . . . .                                        | 5-17 |
| 5.13.1  | Numeric types . . . . .                                                                                     | 5-18 |
| 5.13.2  | Type synonyms . . . . .                                                                                     | 5-18 |
| 5.13.3  | The numeric value corresponding to a numeric type . . . . .                                                 | 5-18 |
| 5.13.4  | Conditional compilation . . . . .                                                                           | 5-19 |
| 6       | <b>BSV: Struct types, tuples, and RISC-V: Memory requests and responses</b>                                 | 6-1  |
| 6.1     | RISC-V: structs communicated between steps . . . . .                                                        | 6-1  |
| 6.2     | <b>BSV: struct types</b> . . . . .                                                                          | 6-2  |
| 6.2.1   | Creating struct values . . . . .                                                                            | 6-4  |
| 6.2.2   | Don’t-care values . . . . .                                                                                 | 6-4  |
| 6.2.3   | Selecting struct fields . . . . .                                                                           | 6-5  |
| 6.2.4   | Updating struct fields using assignment . . . . .                                                           | 6-5  |
| 6.3     | BSV: Tuples and the <code>match</code> statement . . . . .                                                  | 6-5  |
| 6.4     | RISC-V: Memory Requests and Responses; IMem and DMem . . . . .                                              | 6-6  |
| 6.4.1   | Separation of IMem and DMem (Harvard Architecture) . . . . .                                                | 6-7  |
| 6.4.2   | Memory Requests . . . . .                                                                                   | 6-7  |
| 6.4.3   | Address Alignment . . . . .                                                                                 | 6-8  |
| 6.4.4   | Memory Responses . . . . .                                                                                  | 6-9  |
| 7       | <b>RISC-V: Core functions for RISC-V ISA execution (used in Drum and Fife)</b>                              | 7-1  |
| 7.1     | Introduction . . . . .                                                                                      | 7-1  |
| 7.2     | The function <code>fn_Fetch</code> . . . . .                                                                | 7-1  |

|           |                                                                                                         |             |
|-----------|---------------------------------------------------------------------------------------------------------|-------------|
| 7.3       | The <code>fn_Decode</code> function . . . . .                                                           | 7-3         |
| 7.4       | The <code>fn_Dispatch</code> function after reading input registers . . . . .                           | 7-7         |
| 7.5       | The <code>fn_EX_Control</code> function . . . . .                                                       | 7-11        |
| 7.6       | The <code>fn_EX_Int</code> function . . . . .                                                           | 7-13        |
| 7.7       | No separate functions for Execute DMem and Retire . . . . .                                             | 7-17        |
| <b>8</b>  | <b>BSV: Modules and Interfaces: Registers, Register Files and FIFOs</b>                                 | <b>8-1</b>  |
| 8.1       | Introduction . . . . .                                                                                  | 8-1         |
| 8.2       | Modules: state, interfaces and behavior . . . . .                                                       | 8-1         |
| 8.2.1     | Internal behavior ( <i>rules</i> ) . . . . .                                                            | 8-2         |
| 8.2.2     | Interface declarations . . . . .                                                                        | 8-2         |
| 8.2.2.1   | Hardware for an interface . . . . .                                                                     | 8-3         |
| 8.2.3     | Module definitions . . . . .                                                                            | 8-4         |
| 8.2.4     | Module instantiation and method invocation . . . . .                                                    | 8-4         |
| 8.3       | BSV Library Modules: Registers . . . . .                                                                | 8-5         |
| 8.3.1     | <code>Reg#(t)</code> , the register interface from the BSV library . . . . .                            | 8-5         |
| 8.3.2     | Registers are strongly typed . . . . .                                                                  | 8-5         |
| 8.3.3     | <code>mkReg(v)</code> , a register module (constructor) from the BSV library . . . . .                  | 8-5         |
| 8.3.4     | Syntactic shorthands for register access . . . . .                                                      | 8-6         |
| 8.4       | BSV Library Modules: Register files . . . . .                                                           | 8-7         |
| 8.4.1     | The register file interface <code>RegFile#(index_t,data_t)</code> from the BSV library . . . . .        | 8-7         |
| 8.4.2     | <code>mkRegFileFull</code> , a register file module (constructor) from the BSV library . . . . .        | 8-7         |
| 8.5       | BSV Library Modules: FIFOs . . . . .                                                                    | 8-8         |
| 8.5.1     | <code>FIFO#(t)</code> , the FIFO interface from the BSV library . . . . .                               | 8-8         |
| 8.5.1.1   | <code>pop</code> : a useful function combining <code>first</code> and <code>deq</code> . . . . .        | 8-8         |
| 8.5.2     | <code>mkFIFO</code> , a FIFO module (constructor) from the BSV library . . . . .                        | 8-9         |
| 8.5.3     | FIFOs are strongly typed . . . . .                                                                      | 8-10        |
| 8.5.4     | Semi-FIFO interfaces for each end of a FIFO . . . . .                                                   | 8-10        |
| 8.5.5     | Interface-transformer functions . . . . .                                                               | 8-11        |
| 8.5.6     | Connecting FIFOs . . . . .                                                                              | 8-11        |
| 8.5.7     | <code>mkPipelineFIFO</code> and <code>mkBypassFIFO</code> : constructors from the BSV library . . . . . | 8-13        |
| 8.6       | Polymorphic and Monomorphic Types . . . . .                                                             | 8-13        |
| 8.6.1     | Polymorphic Modules and Synthesizability into Verilog . . . . .                                         | 8-14        |
| <b>9</b>  | <b>RISC-V: Modules for GPRs and CSRs</b>                                                                | <b>9-1</b>  |
| 9.1       | Introduction . . . . .                                                                                  | 9-1         |
| 9.2       | A register file for GPRs, with special treatment of <code>x0</code> . . . . .                           | 9-1         |
| 9.2.1     | Inlined <i>vs.</i> separate module <code>mkGPRs</code> . . . . .                                        | 9-2         |
| 9.3       | A register file for RISC-V CSRs . . . . .                                                               | 9-3         |
| <b>10</b> | <b>BSV: FSMs</b>                                                                                        | <b>10-1</b> |
| 10.1      | Introduction . . . . .                                                                                  | 10-1        |
| 10.1.1    | Sequential FSMs, Concurrent FSMs, and Digital Hardware . . . . .                                        | 10-2        |
| 10.2      | Rules and StmtFSM in BSV . . . . .                                                                      | 10-3        |
| 10.3      | Actions and the <code>Action</code> type . . . . .                                                      | 10-4        |
| 10.3.1    | <code>Action</code> blocks: composing actions into larger actions . . . . .                             | 10-4        |
| 10.3.2    | Binding names in <code>Action</code> blocks . . . . .                                                   | 10-5        |
| 10.4      | <code>StmtFSM</code> : sequences of actions . . . . .                                                   | 10-6        |
| 10.5      | <code>StmtFSM</code> : conditionals (if-then-else) . . . . .                                            | 10-6        |
| 10.6      | <code>StmtFSM</code> : while-loops . . . . .                                                            | 10-6        |
| 10.7      | <code>StmtFSM</code> : pausing until some condition holds . . . . .                                     | 10-7        |
| 10.8      | <code>StmtFSM</code> : <code>mkAutoFSM</code> : a simple FSM module constructor . . . . .               | 10-7        |
| 10.9      | <code>StmtFSM</code> in testbenches . . . . .                                                           | 10-7        |
| 10.10     | <code>StmtFSM</code> : many more features . . . . .                                                     | 10-8        |
| <b>11</b> | <b>RISC-V: the Drum unpipelined CPU (an FSM)</b>                                                        | <b>11-1</b> |
| 11.1      | Introduction . . . . .                                                                                  | 11-1        |
| 11.2      | The Drum CPU module interface . . . . .                                                                 | 11-1        |

|           |                                                                                                          |             |
|-----------|----------------------------------------------------------------------------------------------------------|-------------|
| 11.3      | The Drum CPU module . . . . .                                                                            | 11-2        |
| 11.4      | Help-functions for the Drum CPU module behavior . . . . .                                                | 11-4        |
| 11.5      | The main behavior actions in the Drum CPU module . . . . .                                               | 11-5        |
| 11.5.1    | FSM action for Fetch . . . . .                                                                           | 11-5        |
| 11.5.2    | FSM action for Decode . . . . .                                                                          | 11-6        |
| 11.5.3    | FSM action for Dispatch . . . . .                                                                        | 11-6        |
| 11.5.4    | FSM actions for Execute and Retire . . . . .                                                             | 11-6        |
| 11.5.4.1  | FSM actions in Direct flow of Execute and Retire . . . . .                                               | 11-7        |
| 11.5.4.2  | Counting retired instructions in CSR <code>minstret</code> . . . . .                                     | 11-9        |
| 11.5.4.3  | FSM actions in Execute and Retire Control flow . . . . .                                                 | 11-9        |
| 11.5.4.4  | FSM actions in Execute and Retire Integer flow . . . . .                                                 | 11-10       |
| 11.5.4.5  | FSM actions in Execute and Retire DMem flow . . . . .                                                    | 11-11       |
| 11.5.5    | FSM actions for exceptions . . . . .                                                                     | 11-12       |
| 11.6      | The Drum CPU module behavior . . . . .                                                                   | 11-12       |
| 11.7      | Conclusion . . . . .                                                                                     | 11-14       |
| 11.7.1    | But Drum code looks just like C!? Why not code it in C?                                                  | 11-14       |
| <b>12</b> | <b>BSV: Verifying BSV designs</b>                                                                        | <b>12-1</b> |
| 12.1      | Introduction . . . . .                                                                                   | 12-1        |
| 12.2      | BSV: Testbenches and DUTs . . . . .                                                                      | 12-1        |
| 12.3      | BSV: “printf”-style Debugging . . . . .                                                                  | 12-2        |
| 12.3.1    | FShow for “pretty-printing” enums and structs . . . . .                                                  | 12-3        |
| 12.3.2    | Fmt formatted values . . . . .                                                                           | 12-3        |
| 12.4      | BSV: Dynamic assertions . . . . .                                                                        | 12-4        |
| 12.5      | BSV: Waveform-style debugging . . . . .                                                                  | 12-5        |
| <b>13</b> | <b>RISC-V: Functional verification of CPUs</b>                                                           | <b>13-1</b> |
| 13.1      | Introduction . . . . .                                                                                   | 13-1        |
| 13.2      | Trusted functional simulators (“golden reference models”) . . . . .                                      | 13-2        |
| 13.3      | RISC-V test programs for verification . . . . .                                                          | 13-3        |
| 13.3.1    | ISA tests . . . . .                                                                                      | 13-3        |
| 13.3.2    | ACTs and other test suites . . . . .                                                                     | 13-4        |
| 13.3.3    | What does “verified” mean? Levels of assurance; coverage . . . . .                                       | 13-4        |
| 13.4      | A testbench for Drum and Fife . . . . .                                                                  | 13-5        |
| 13.5      | Symmetric Tandem Verification of CPU implementations . . . . .                                           | 13-7        |
| 13.5.1    | Configuration . . . . .                                                                                  | 13-8        |
| 13.5.2    | Level of detail in traces . . . . .                                                                      | 13-8        |
| 13.5.3    | Online <i>vs.</i> offline tandem verification . . . . .                                                  | 13-9        |
| 13.6      | Asymmetric tandem verification and “full-system” verification . . . . .                                  | 13-9        |
| 13.6.1    | Instructions with non-deterministic results . . . . .                                                    | 13-9        |
| 13.6.2    | Reading uninitialized memory . . . . .                                                                   | 13-9        |
| 13.6.3    | Devices and interrupts . . . . .                                                                         | 13-10       |
| 13.6.4    | Asymmetric mode . . . . .                                                                                | 13-10       |
| 13.7      | Tandem verification with real hardware (FPGA or ASIC) . . . . .                                          | 13-11       |
| <b>14</b> | <b>BSV: Rules and their Semantics</b>                                                                    | <b>14-1</b> |
| 14.1      | Introduction . . . . .                                                                                   | 14-1        |
| 14.2      | Syntax of a rule and the data types of its components . . . . .                                          | 14-1        |
| 14.3      | Semantics of a rule in isolation . . . . .                                                               | 14-2        |
| 14.3.1    | Hardware representation of a rule in isolation . . . . .                                                 | 14-3        |
| 14.3.2    | A rule firing cannot perform the same action more than once . . . . .                                    | 14-5        |
| 14.4      | Semantics of a collection of rules . . . . .                                                             | 14-6        |
| 14.5      | Mapping rules to a clock, for real-time behavior . . . . .                                               | 14-7        |
| 14.5.1    | Constraints on mapping rules to a clock . . . . .                                                        | 14-7        |
| 14.5.2    | The Rule Controller produced by the <code>bsc</code> compiler, and reasoning about performance . . . . . | 14-9        |
| 14.5.3    | Explicit control of rule ordering, and controller optimizations . . . . .                                | 14-9        |

|                                                                                                     |             |
|-----------------------------------------------------------------------------------------------------|-------------|
| 14.6 StmtFSM can always be translated into rules . . . . .                                          | 14-10       |
| 14.7 When to use StmtFSM and when to use rules . . . . .                                            | 14-11       |
| 14.7.1 Use rules for unstructured processes . . . . .                                               | 14-11       |
| <b>15 RISC-V: the Drum unpipelined CPU (using Rules instead of StmtFSM)</b> . . . . .               | <b>15-1</b> |
| 15.1 Introduction . . . . .                                                                         | 15-1        |
| 15.2 The Drum CPU module behavior with Rules . . . . .                                              | 15-1        |
| 15.2.1 Optimizing the Drum rules . . . . .                                                          | 15-3        |
| 15.3 Conclusion . . . . .                                                                           | 15-4        |
| <b>16 RISC-V: the Fife pipelined CPU: Principles</b> . . . . .                                      | <b>16-1</b> |
| 16.1 Introduction . . . . .                                                                         | 16-1        |
| 16.2 Keeping the Fetch Stage Working with PC Prediction and Epochs . . . . .                        | 16-2        |
| 16.2.1 PC Prediction in the Fetch Stage . . . . .                                                   | 16-2        |
| 16.2.2 Identifying and Flushing Wrong-path Instructions . . . . .                                   | 16-3        |
| 16.2.3 Terminology: Speculative Instructions and Commits . . . . .                                  | 16-4        |
| 16.2.4 Speculative instructions should not have any side-effects . . . . .                          | 16-4        |
| 16.3 Managing Register Read/Write Hazards with a Scoreboard . . . . .                               | 16-4        |
| 16.3.1 Releasing Scoreboard Reservations for Uncommitted Instructions . . . . .                     | 16-6        |
| 16.3.2 Bypassing . . . . .                                                                          | 16-6        |
| 16.4 Retiring outputs of the Execute Stages in Order with Tags . . . . .                            | 16-7        |
| 16.5 Allowing Memory Ops to be Pipelined, with a Store Buffer . . . . .                             | 16-7        |
| 16.5.1 What about LOADs and STOREs to non-memory-like devices (MMIO)? . . . . .                     | 16-8        |
| 16.6 The Retire Stage . . . . .                                                                     | 16-9        |
| <b>17 RISC-V: the Fife pipelined CPU code</b> . . . . .                                             | <b>17-1</b> |
| 17.1 Introduction . . . . .                                                                         | 17-1        |
| 17.2 The Fife top-level CPU module . . . . .                                                        | 17-1        |
| 17.3 How we connect stages . . . . .                                                                | 17-3        |
| 17.4 The Fetch stage . . . . .                                                                      | 17-4        |
| 17.4.1 Prioritizing rule <code>r1_Fetch_from_Retire</code> over <code>r1_Fetch_req</code> . . . . . | 17-7        |
| 17.5 The Decode stage . . . . .                                                                     | 17-7        |
| 17.5.1 Balancing concurrent paths in a pipeline . . . . .                                           | 17-9        |
| 17.6 The Register-Read and Dispatch (and Register-Write) stage . . . . .                            | 17-9        |
| 17.6.1 BSV: Vectors for the Scoreboard . . . . .                                                    | 17-10       |
| 17.6.2 The Register-Read and Dispatch (and Register-Write) module . . . . .                         | 17-11       |
| 17.7 The Execute Control stage . . . . .                                                            | 17-15       |
| 17.8 The Execute Integer Ops stage . . . . .                                                        | 17-16       |
| 17.9 The Execute Memory Ops stage (speculative DMem) . . . . .                                      | 17-17       |
| 17.10 Fife: the Retire stage . . . . .                                                              | 17-17       |
| 17.10.1 Common facilities used by many rules . . . . .                                              | 17-19       |
| 17.10.2 Rule to retire wrong-path instructions (all paths; discard) . . . . .                       | 17-21       |
| 17.10.3 Rules to retire from direct path . . . . .                                                  | 17-21       |
| 17.10.3.1 Rule to retire CSR <sub>Rxx</sub> instructions (direct path) . . . . .                    | 17-21       |
| 17.10.3.2 Rule to retire MRET instructions (direct path) . . . . .                                  | 17-23       |
| 17.10.3.3 Rule to retire ECALL and EBREAK instructions (direct path) . . . . .                      | 17-23       |
| 17.10.3.4 Rule to retire exceptions from the direct path . . . . .                                  | 17-24       |
| 17.10.4 Rule to retire from the Execute Control path . . . . .                                      | 17-24       |
| 17.10.5 Rule to retire from the Execute Integer path . . . . .                                      | 17-25       |
| 17.10.6 Rules to retire from the Execute DMem path, or perform deferred DMem request . . . . .      | 17-26       |
| 17.10.6.1 Retire speculative and exception from DMem . . . . .                                      | 17-27       |
| 17.10.6.2 Rule to handle deferred DMem requests (from Execute DMem path) . . . . .                  | 17-28       |
| 17.10.6.3 Rules to retire responses for deferred DMem requests . . . . .                            | 17-29       |
| 17.10.7 Common Rule to handle exceptions . . . . .                                                  | 17-30       |
| 17.10.8 Fife module interface definition . . . . .                                                  | 17-30       |
| 17.11 Conclusion . . . . .                                                                          | 17-31       |

|                                                                                                                        |             |
|------------------------------------------------------------------------------------------------------------------------|-------------|
| <b>18 BSV: Rules and Methods II: Improved performance with CRegs (Concurrent Registers)</b>                            | <b>18-1</b> |
| 18.1 Introduction . . . . .                                                                                            | 18-1        |
| 18.2 Example: Counter with <code>.incr</code> and <code>.decr</code> methods . . . . .                                 | 18-1        |
| 18.2.1 Semantic and Performance Analysis of <code>mkUp_Down_Counter_I</code> when mapped to clocked hardware . . . . . | 18-2        |
| 18.3 Concurrent Registers <code>CRegs</code> , and a Faster Up-Down Counter . . . . .                                  | 18-3        |
| 18.3.1 A possible hardware implementation of a <code>CReg</code> . . . . .                                             | 18-3        |
| 18.3.2 A Faster Up-Down Counter, using a <code>CReg</code> . . . . .                                                   | 18-4        |
| 18.4 Example: Using a <code>CReg</code> for the RISC-V CSR <code>mcycle</code> . . . . .                               | 18-5        |
| 18.5 PipelineFIFOs and BypassFIFOs . . . . .                                                                           | 18-5        |
| 18.5.1 <code>PipelineFIFOF</code> . . . . .                                                                            | 18-6        |
| 18.5.2 <code>BypassFIFOF</code> . . . . .                                                                              | 18-7        |
| 18.5.3 Back-to-back compositions of <code>BypassFIFOF</code> and <code>PipelineFIFOF</code> . . . . .                  | 18-9        |
| 18.6 Alternatives to <code>CRegs</code> : RWires and their variants (deprecated) . . . . .                             | 18-10       |
| <b>19 RISC-V: Optimizing Drum and Fife</b>                                                                             | <b>19-1</b> |
| 19.1 Introduction . . . . .                                                                                            | 19-1        |
| 19.2 Pipeline traces and visualization to analyze performance . . . . .                                                | 19-2        |
| 19.3 Optimization opportunities in Drum and Fife . . . . .                                                             | 19-4        |
| 19.3.1 Drum and Fife: Fusing Decode and RR-Dispatch . . . . .                                                          | 19-4        |
| 19.3.2 Drum and Fife: Fusing some Retire actions . . . . .                                                             | 19-6        |
| 19.3.3 Drum (rules version): short-circuiting steps . . . . .                                                          | 19-6        |
| 19.3.4 Drum: using narrower inter-step/stage buffers . . . . .                                                         | 19-6        |
| 19.3.5 Fife: saving FIFO resources . . . . .                                                                           | 19-8        |
| 19.3.6 Fife: Reducing the misprediction penalty . . . . .                                                              | 19-8        |
| 19.3.6.1 Save a tick for Fetch redirection using <code>CRegs</code> . . . . .                                          | 19-8        |
| 19.3.6.2 Save a tick for Fetch redirection by eliminating backward FIFO . . . . .                                      | 19-9        |
| 19.3.6.3 Quicker reaction to redirection by Register-Read and Dispatch . . . . .                                       | 19-9        |
| 19.3.6.4 Better next-PC prediction . . . . .                                                                           | 19-9        |
| 19.3.7 Fife: Reducing the register-hazard penalty . . . . .                                                            | 19-12       |
| 19.3.7.1 Fife Bypassing: Save a tick in GPRs and Scoreboard using <code>CRegs</code> . . . . .                         | 19-12       |
| 19.3.7.2 Fife: Dispatching multiple instructions that write to the same Rd . . . . .                                   | 19-13       |
| 19.3.7.3 Save a tick for register update by eliminating backward FIFO . . . . .                                        | 19-13       |
| 19.3.8 Drum and Fife: Reducing memory system delays . . . . .                                                          | 19-13       |
| 19.3.8.1 TCMs (Tightly Coupled Memories) . . . . .                                                                     | 19-14       |
| 19.3.8.2 Caches . . . . .                                                                                              | 19-14       |
| 19.3.8.3 Virtual Memory . . . . .                                                                                      | 19-14       |
| 19.3.8.4 TLBs (Translation Lookaside Buffers) . . . . .                                                                | 19-14       |
| 19.3.8.5 TLBs and Caches . . . . .                                                                                     | 19-14       |
| <b>20 BSV: Suggested further study</b>                                                                                 | <b>20-1</b> |
| 20.1 Introduction . . . . .                                                                                            | 20-1        |
| 20.2 Alternative simulators: Bluesim, other Verilog sims . . . . .                                                     | 20-1        |
| 20.3 First-class modules . . . . .                                                                                     | 20-1        |
| 20.4 Polymorphism . . . . .                                                                                            | 20-1        |
| 20.5 Typeclasses. Conversion of Integer literals. Bits/pack/unpack . . . . .                                           | 20-1        |
| 20.6 Bluecheck . . . . .                                                                                               | 20-1        |
| 20.7 Tagged unions, pattern-matching . . . . .                                                                         | 20-1        |
| 20.8 Multiple Clock Domains . . . . .                                                                                  | 20-1        |
| 20.9 Importing RTL . . . . .                                                                                           | 20-1        |
| 20.10 BH alternative syntax . . . . .                                                                                  | 20-1        |
| <b>21 RISC-V: Suggested further study</b>                                                                              | <b>21-1</b> |
| 21.1 Introduction . . . . .                                                                                            | 21-1        |
| 21.2 Implementing RV64I instead of RV32I . . . . .                                                                     | 21-1        |
| 21.3 M Extension . . . . .                                                                                             | 21-1        |
| 21.4 F and D Extensions . . . . .                                                                                      | 21-1        |

|                                                                                            |              |
|--------------------------------------------------------------------------------------------|--------------|
| 21.5 C Extension . . . . .                                                                 | 21-1         |
| 21.6 A extention . . . . .                                                                 | 21-2         |
| 21.7 Advanced branch prediction . . . . .                                                  | 21-2         |
| 21.8 Register renaming: towards out of order processing . . . . .                          | 21-2         |
| 21.9 Advanced bypassing: towards dataflow and out-of-order processing . . . . .            | 21-2         |
| 21.10 Memory systems: TCMs, Caches, PMPs . . . . .                                         | 21-2         |
| 21.11 Memory Systems: Virtual Memory . . . . .                                             | 21-3         |
| 21.12 Performance measurement . . . . .                                                    | 21-3         |
| 21.13 Testing . . . . .                                                                    | 21-3         |
| 21.14 Interrupts . . . . .                                                                 | 21-3         |
| 21.15 Linux and server-class capability . . . . .                                          | 21-3         |
| 21.15.1 Hypervisor support . . . . .                                                       | 21-3         |
| 21.15.2 RISC-V ISA Formal Specification . . . . .                                          | 21-3         |
| <b>A Resources: Documents and Tools</b>                                                    | <b>A-1</b>   |
| A.1 GitHub . . . . .                                                                       | A-1          |
| A.2 RISC-V ISA (Instruction Set Architecture) Specifications . . . . .                     | A-1          |
| A.3 RISC-V Trusted Simulators and Reference Programs for Testing Implementations . . . . . | A-2          |
| A.4 RISC-V Assembly Language Manuals . . . . .                                             | A-2          |
| A.5 RISC-V GNU tools, including <code>riscv-gcc</code> compiler . . . . .                  | A-2          |
| A.6 BSV . . . . .                                                                          | A-3          |
| A.6.1 “BSV By Example” book (free downloadable PDF) . . . . .                              | A-3          |
| A.6.2 BSV Tutorial . . . . .                                                               | A-3          |
| A.6.3 MIT Course Material . . . . .                                                        | A-4          |
| A.6.4 University of Cambridge Examples . . . . .                                           | A-4          |
| A.6.5 <i>bsc</i> download and installation; <i>bsc</i> and <b>BSV</b> manuals . . . . .    | A-4          |
| A.7 Verilator (or other Verilog simulator) . . . . .                                       | A-5          |
| A.8 Amazon AWS . . . . .                                                                   | A-5          |
| A.9 Xilinx Vivado . . . . .                                                                | A-6          |
| A.10 RISC-V textbooks . . . . .                                                            | A-6          |
| <b>B Why BSV?</b>                                                                          | <b>B-1</b>   |
| B.1 Why BSV instead of some other Hardware Design Language? . . . . .                      | B-2          |
| B.1.1 A better computational model . . . . .                                               | B-2          |
| B.1.2 Modern language features . . . . .                                                   | B-3          |
| B.1.3 Comparison with C++-based High Level Synthesis . . . . .                             | B-4          |
| B.1.3.1 C++ codes need significant rewriting . . . . .                                     | B-4          |
| B.1.3.2 Narrow range of applicability due to automatic parallelization . . . . .           | B-4          |
| B.1.3.3 Lack of “Algotecture”: Architectural transparency and predictability . . . . .     | B-5          |
| B.1.3.4 Summary . . . . .                                                                  | B-5          |
| <b>C Glossary</b>                                                                          | <b>C-1</b>   |
| <b>D BSV: Importing C/C++ functions into BSV simulations</b>                               | <b>D-1</b>   |
| D.1 Introduction . . . . .                                                                 | D-1          |
| D.2 In BSV code, declare a BSV version of the C function . . . . .                         | D-2          |
| D.3 Compile the BSV code with the <i>bsc</i> compiler . . . . .                            | D-2          |
| D.4 Linking . . . . .                                                                      | D-2          |
| D.5 Recommendations for arguments and results of imported C/C++ functions . . . . .        | D-3          |
| D.5.1 Only use BSV types corresponding to C types . . . . .                                | D-3          |
| D.5.2 Use <code>ActionValue#(t)</code> for imported C function’s result . . . . .          | D-3          |
| D.6 Example: Memory Model for Drum and Fife . . . . .                                      | D-4          |
| <b>BSV Index</b>                                                                           | <b>BIB-1</b> |
| <b>RISC-V Index</b>                                                                        | <b>BIB-1</b> |
| <b>Bibliography</b>                                                                        | <b>BIB-1</b> |



# List of Figures

|      |                                                                                                                        |      |
|------|------------------------------------------------------------------------------------------------------------------------|------|
| 1.1  | Topics covered in this book (in red text in red box)                                                                   | 1-2  |
| 2.1  | What is an ISA?                                                                                                        | 2-1  |
| 2.2  | Modularity of the RISC-V ISA                                                                                           | 2-5  |
| 2.3  | RISC-V Instruction Encodings                                                                                           | 2-6  |
| 2.4  | Construction of 21-bit immediate from 20 “imm” bits in J-type instructions                                             | 2-7  |
| 2.5  | Construction of 13-bit immediate from 12 “imm” bits in B-type instructions                                             | 2-7  |
| 2.6  | RISC-V RV32I Instructions                                                                                              | 2-8  |
| 2.7  | RISC-V LUI and AUIPC Instruction semantics                                                                             | 2-8  |
| 2.8  | Execution semantics for LUI instructions                                                                               | 2-9  |
| 2.9  | Execution semantics for AUIPC instructions                                                                             | 2-9  |
| 2.10 | JAL and JALR for subroutine call and return                                                                            | 2-11 |
| 2.11 | CSRs for handling traps                                                                                                | 2-12 |
| 2.12 | Trap and return flow                                                                                                   | 2-13 |
| 2.13 | CSRxx instructions (from Unprivileged Spec)                                                                            | 2-15 |
| 2.14 | CSRxx instruction semantics                                                                                            | 2-15 |
| 2.15 | RV64I instructions (in addition to RV32I)                                                                              | 2-17 |
| 3.1  | Simple interpretation of RISC-V instructions                                                                           | 3-4  |
| 4.1  | File-level view of a BSV program                                                                                       | 4-2  |
| 4.2  | What’s in a BSV package/file?                                                                                          | 4-2  |
| 4.3  | Namespace control with imports and exports                                                                             | 4-3  |
| 4.4  | Typical components in an interface declaration                                                                         | 4-5  |
| 4.5  | Typical components in a module declaration                                                                             | 4-5  |
| 4.6  | Typical components of a rule                                                                                           | 4-6  |
| 4.7  | Typical components of an interface definition                                                                          | 4-7  |
| 4.8  | Static elaboration                                                                                                     | 4-8  |
| 4.9  | Module interaction <i>via</i> methods                                                                                  | 4-9  |
| 5.1  | RISC-V conditional BRANCH instructions                                                                                 | 5-5  |
| 5.2  | Testing for a legal BRANCH instruction                                                                                 | 5-6  |
| 5.3  | If-then-else is a multiplexer                                                                                          | 5-14 |
| 5.4  | Nested if-then-elses become cascaded multiplexers                                                                      | 5-15 |
| 5.5  | Nested if-then-elses using an AND-OR MUX (for mutually exclusive conditions)                                           | 5-16 |
| 6.1  | Simple interpretation of RISC-V instructions (Fig. 3.1 with arrows annotated with <code>struct</code> types)           | 6-1  |
| 7.1  | Simple interpretation of RISC-V instructions                                                                           | 7-1  |
| 8.1  | Hardware wires/buses for interface methods of <code>ActionValue(t)</code> , <code>Action</code> and value result types | 8-3  |
| 10.1 | Simple interpretation of RISC-V instructions (same as Fig. 6.1)                                                        | 10-1 |

|      |                                                                                                                                    |       |
|------|------------------------------------------------------------------------------------------------------------------------------------|-------|
| 11.1 | Simple interpretation of RISC-V instructions (same as Fig. 6.1) . . . . .                                                          | 11-1  |
| 11.2 | Execute and Retire actions in Drum . . . . .                                                                                       | 11-7  |
| 12.1 | A testbench connected to a DUT . . . . .                                                                                           | 12-1  |
| 13.1 | Top-level simulation setup for the Drum and Fife CPUs . . . . .                                                                    | 13-5  |
| 13.2 | Tandem verification . . . . .                                                                                                      | 13-7  |
| 13.3 | Asymmetric tandem verification: dealing with “minor” differences, interrupts (asynchronous events), devices, <i>etc.</i> . . . . . | 13-10 |
| 14.1 | Syntactic structure of a rule . . . . .                                                                                            | 14-1  |
| 14.2 | Semantics of a rule in isolation . . . . .                                                                                         | 14-3  |
| 14.3 | Hardware representation of a rule in isolation . . . . .                                                                           | 14-4  |
| 14.4 | A clock signal . . . . .                                                                                                           | 14-7  |
| 14.5 | Controlling rule execution (CAN_FIRE excerpt from Figure 14.3) . . . . .                                                           | 14-8  |
| 16.1 | Pipelined interpretation of RISC-V instructions (Fig. 3.1 with some annotations) . .                                               | 16-1  |
| 16.2 | Retire actions in Fife . . . . .                                                                                                   | 16-9  |
| 17.1 | Pipelined interpretation of RISC-V instructions (Fig. 3.1 with some annotations) . .                                               | 17-1  |
| 17.2 | How we connect Fife stages . . . . .                                                                                               | 17-4  |
| 17.3 | Actions in the “Retire” stage of Fife (same as Fig. 16.2) . . . . .                                                                | 17-17 |
| 18.1 | A possible hardware implementation of a CReg . . . . .                                                                             | 18-4  |
| 18.2 | A producer and consumer connected with a FIFO . . . . .                                                                            | 18-6  |
| 18.3 | Combinational path through <code>mkPipelineFIFO</code> . . . . .                                                                   | 18-7  |
| 18.4 | Combinational paths through <code>mkBypassFIFO</code> . . . . .                                                                    | 18-8  |
| 18.5 | A producer and consumer connected with a composed <code>BypassFIFO-PipelineFIFO</code> . .                                         | 18-9  |
| 18.6 | Pipeline-stage modularity enabled by composing a <code>BypassFIFO</code> and a <code>PipelineFIFO</code>                           | 18-10 |
| 19.1 | Visualization of the per-instruction step/stage events in Drum and Fife . . . . .                                                  | 19-4  |

# Chapter 1

## Introduction

“Digital Design” and “CPU Design” (or “Computer Architecture”) are traditionally taught separately, usually in that order, with separate textbooks. Digital Design is usually taught using one of the traditional hardware design languages Verilog, SystemVerilog or VHDL, and often makes use of small, often artificial examples. CPU Design is often taught without actually designing hardware, relying instead on textbooks, abstract schematics, and simulators implemented in software.

This book takes a different approach: we learn about simple CPU architectures by designing them with a modern Hardware Design Language (HDL) called BSV, learning Digital Design as an ongoing, intertwined accompanying topic. Each Digital Design example will be taken directly from the CPU Design, so that the example’s use-case (context) is always perfectly clear, and the reader always has a clear sense of the purpose of the example.

The CPUs we design here will execute instructions from the RISC-V Instruction Set Architecture (ISA), which is an industrial-strength ISA (with many commercial implementations). Our designs will be simple (typical of small, embedded systems and micro-controllers, not laptops/workstations or servers).

Figure 1.1 shows the broad range of topics in which a CPU designer needs to engage. Some of the early topics are not RISC-V implementation-specific:

- The first step is to understand the RISC-V ISA itself. What are RISC-V instructions, how are they coded in bits, and what do they mean? This topic is not a focus of this book (for which there are plenty of textbooks and other educational materials available), but understanding the ISA is of course a prerequisite to informing our design, so we provide a brief overview in Chapter 2.

The RISC-V ISA has many options; our focus in this book will be on a small “standard” subset that is adequate for small embedded systems:

- The so-called “RV32I” subset from the RISC-V Unprivileged ISA spec: basic integer arithmetic and logic operations; branch and jump; load and store.
  - A few elements from the RISC-V Privileged ISA spec for handling exceptions: Control and Status Registers (CSRs), traps and trap-handling.
- In order to run actual RISC-V programs on our implementations, we need to understand how to use the *riscv-gcc* compiler to compile C and RISC-V Assembly Language



Figure 1.1: Topics covered in this book (in red text in red box)

programs into RISC-V binaries (so-called “ELF” files). Another useful tool is *riscv-objdump*, which can disassemble the binary back into assembly-language text. This is useful for debugging our implementation, so that we can understand execution instruction-by-instruction, and diagnose anything that goes wrong.

How to install and use these tools is not a focus of this book but of course we need to use these tools to produce programs to run on our implementations.

So, far, all this is not implementation-specific, *i.e.*, it is generic information about RISC-V. The following topics dive into implementations.

- A RISC-V CPU and system can be *modeled* in a simulator coded in C (say). Such a C-based simulator is compiled (with *gcc*, say) and run like any other C program. We will not be discussing this much in this book.
- We will code our hardware design in the BSV HL-HDL. We will use BSV not just for the CPU itself, but also for the “system” components around it: an interconnect, Memory, UART and GPIO.
- We will use the *bsc* compiler to translate our BSV code into Verilog RTL.
- We will learn how our Verilog RTL can directly be simulated in a Verilog simulator. We will use the free, open-source “Verilator” simulator, but you can also run it on any other Verilog simulator, available from a number of providers.

This will provide an exact, cycle-by-cycle accurate simulation of the very same design that we’ll run later on an FPGA. This is invaluable for debugging the hardware design,

because the turnaround time to fix a problem and run a new simulation is very short (minutes) compared to creating a new version for an FPGA (several hours).

Of course, Verilog simulation will run much more slowly (10,000x or more slower) compared to an FPGA, and so is useful primarily for early debugging and analysis of the design, running on small RISC-V programs.

- When we execute our Verilog RTL hardware design in Verilog simulation (where the hardware design itself is executing a RISC-V binary program), it will produce a trace file describing events during the simulation. We will learn how to analyze these traces to identify bugs and bottlenecks in our design, from which we can correct design errors and possibly improve performance.
- We will discuss how to process our Verilog RTL through an FPGA synthesis tool to create an FPGA bitfile which can then be loaded into an FPGA and executed. Although it can be synthesized and run on a number of FPGAs from different vendors, in this book we'll discuss how to build and run it for an FPGA on the Amazon AWS cloud.
- Our Verilog RTL can also be processed through ASIC synthesis tools targeting ASIC fabrication. We will not be discussing this much in this book.

### 1.0.1 Drum and Fife, the two RISC-V implementations designed in this book

We will create two hardware RISC-V designs, called Drum and Fife:

- Drum is a *non-pipelined* implementation which will familiarize us with all the basic concepts and flows (the RISC-V ISA, preparing and running a RISC-V binary to run on the design, analysing traces), without being distracted by the complexities of pipelineing for high performance.
- Fife is a 5-6 stage pipelined design, which is a microarchitectural change focused on higher performance (speed) than Drum. The design includes simple branch-prediction, register read/write hazard management, in-order retirement of instructions, and speculative stores to memory using a store-buffer.

The two designs will share a large part of the BSV code that implements the essential semantic functionality of the RISC-V ISA. By discussing this shared code in the simpler Drum, the Fife chapters can focus purely on the new issues raised by pipelining (and which are, in fact, not RISC-V specific, but common to all pipelined CPU designs).

Drum and Fife execute exactly the same binaries; the only difference will be in Fife's superior performance (speed).

As we work through the two designs, we will concurrently learn how to code in BSV, the HL-HDL for our designs. BSV is a modern, high-level HDL taking inspiration from modern software programming languages, in particular the Haskell functional programming language and a class of formal specification languages for concurrent programming (including Term Rewriting Systems, Unity, TLA+, and Event-B). BSV is not just for CPU design; like Verilog and SystemVerilog, it is a “universal” language for any digital design, whether related to CPUs or not. Appendix B has a more detailed justification of our choice of BSV over other hardware design languages like Verilog, SystemVerilog, VHDL or Chisel.

### 1.0.2 Drum and Fife source codes

The full source codes (BSV) for Drum and Fife are included with this book. Excerpts of that code are taken, as-is, for inclusion in this book. Excerpts look like this:

```
src_Common/CPU_IFC.bsv: line 27 ...
1 interface CPU_IFC;
2     method Action init (Initial_Params initial_params);
3
4     // IMem
5     interface FIFOF_0 #(Mem_Req) fo_IMem_req;
6     interface FIFOF_I #(Mem_Rsp) fi_IMem_rsp;
7     ...
8     // DMem, non-speculative
9     interface FIFOF_0 #(Mem_Req) fo_DMem_req;
10    interface FIFOF_I #(Mem_Rsp) fi_DMem_rsp;
11
12    // Set TIME
13    (* always_ready, always_enabled *)
14    method Action set_TIME (Bit #(64) t);
15 endinterface
```

The label on the top border of the box indicates file (`src_Common/CPU_IFC.bsv`) from which this has been extracted, and the starting line number (27) in the file. The “...” indicates we have *elided* some lines from the source file which are not directly relevant to the current discussion.

We recommend, as you read the book, that you also keep open a text-editor, in which you can simultaneously view the actual sources.

### 1.0.3 Additional Resources

Chapter [20](#) has suggestions for further study of BSV.

Chapter [21](#) has suggestions for further study of RISC-V and CPU design.

Appendix [A](#) has a detailed listing of resources (documents and software tools) needed for this book, and for further reading.

This book also contains a Glossary of terms and abbreviations, and a detailed index.

# Chapter 2

## Overview of the RISC-V ISA

### 2.1 Introduction

This entire chapter can safely be skipped by those already familiar with RISC-V concepts, or who learn it elsewhere (there are many alternate resources on the web). This chapter contains only generic information about ISAs and the RISC-V ISA in particular. It can be read as a general introduction to the RISC-V universe; none of this is specific to this book.

### 2.2 What is an ISA?

The acronym “ISA” stands for “Instruction Set Architecture”. Figure 2.1 illustrates the role of an ISA as an intermediary between hardware (CPU implementations) and software (which runs on the CPU hardware).



Figure 2.1: What is an ISA?

An ISA, *per se*, is neither software nor hardware. It is merely a *specification*, a document defining three things:

- *Architectural State*: the registers visible to the instructions, such as the program counter (PC) and a register file containing, say, 32 registers named x0 through x31. Typically (nowadays), the architectural state includes a *byte-addressed* memory (a memory where an *address* identifies a single byte, and where individual bytes may be read and written).
- An *Instruction Set*: a collection of instructions. For each instruction, the ISA specifies how it is encoded in bits. The ISA may also specify a way of writing an instruction in symbolic text, which we call Assembly Language. Instructions are typically grouped in classes, such as Immediate, LOAD/STORE, Arithmetic and Logic, Conditional Branches, Jumps, System Instructions, and so on.
- *Semantics*: a description, for each instruction about *how* it executes—what architectural state it observes, and what architectural state it updates, and how. For example, a Conditional Branch instruction typically observes two registers in the register file, compares them, and updates the PC depending on the comparison result.

The overall semantics of a program being executed is just a sequential composition of individual instruction semantics.

The full ISA may be described in ordinary text prose and diagrams, or in a semi-formal or formal language. For example, the RISC-V ISA is described both in text prose and diagrams, and in the formal language “Sail”.

Beneath the ISA level are actual *implementations* of the ISA. These range from software simulators of the ISA to silicon implementations. Silicon implementations typically vary widely in their *micro-architecture* features, such as pipelining, in-order or out-of-order, speculation, superscalarity, multi-threading, multi-core. The choice of micro-architecture is typically based on a particular target market, trading-off cost, performance (speed), energy consumption, capabilities (embedded software to full server OS with network and storage stacks), silicon technology (FPGA *vs.* ASIC, ASICs with various silicon feature sizes and techniques), and so on. Variations in micro-architectures provide product differentiation.

Each implementation runs (interprets) machine code, *i.e.*, an encoding of instructions in memory. The machine codes are typically produced by compilers, which are themselves machine code programs that transform source codes into machine codes. Some machine codes are themselves software interpreters for so-called interpreted languages such as Python, Javascript, Java, Scheme, Lisp, and so on.

A crucially important point is that *all the implementations of an ISA should respect the semantics of the ISA as specified in the ISA definition*. They can (and do) play all kinds of micro-architectural tricks under the covers, but the result of executing any program should be explainable purely based on the ISA specification.

Thus, and ISA is a *contract* or *API* between software and hardware implementations. People writing software, people writing compilers for software, people writing interpreters for interpreted languages, *etc.* need only to understand and refer to the ISA specification to do their job. They do not need to know about the specific hardware implementation

on which it will eventually run. This enables *software portability*, *i.e.*, a machine code program for an ISA should be able to run, without change, on *any* implementation of the ISA, including future next-generation laptops/servers for the ISA.

**NOTE:** In the rarer cases where the “bleeding edge” of software performance really matters, human coders and compilers may produce different machine codes for specific implementations of the same ISA, in order to exploit particular quirks of each implementation’s micro-architecture such as branch-prediction or register-hazard penalties.

Examples of ISAs include:

- *RISC-V*: Open ISA (not proprietary). Silicon implementations from various vendors, worldwide.
- *x86*: Proprietary. Silicon Implementations from Intel and AMD, primarily, with names like Xeon, Core i9, 13th Generation Core, Alder Lake, Raptor Lake, and so on.
- *ARMv8, ARMv9*: Proprietary. Implementations from ARM licensees like Apple, Samsung, Broadcom and others.
- *Power*: Proprietary. Implementations from IBM and other licensees.
- *Sparc*: Proprietary. Implementations originally from Sun Microsystems, nowadays from Oracle, and also from licensees such as Fujitsu.
- *MIPS*: Proprietary. Implementations from MIPS and other licensees.
- Other famous, but now defunct ISAs: *Alpha* (from Digital Equipment Corp./Compaq/HP), *Itanium* (from Intel, HP), *68000* and *88000* (from Motorola), ...

## 2.3 Why choose RISC-V?

In the list of ISAs above, except for RISC-V, all the other ISAs are *proprietary*, *i.e.*, in order to produce and sell a silicon implementation it is necessary to obtain a license (permission) from the company that “owns” the ISA. These licenses can be very expensive, in the thousands of dollars or more.

The RISC-V ISA is owned by RISC-V International (“RVI”), a non-profit corporation based in Switzerland (<https://riscv.org>). As of Fall 2023, RVI claimed a growing membership, including over three thousand corporate, government, university, and individual members from over 70 countries.

The ISA is “open” in that no prior permission or license from RVI is needed in order to produce and sell implementations of the ISA. If a *commercial* product is claimed publicly to implement the RISC-V ISA, the vendor needs to get official certification from RVI that it indeed does so (it must pass a battery of certification tests), but this is a one-time certification, quite different from a production license. Non-commercial products (from universities, research labs, hobbyists, *etc.*) do not need any such certification.

Separate from these commercial concerns, there are also concerns about quality of an ISA and the richness of the ecosystem supporting an ISA. The RISC-V ISA is attractive on these dimensions as well.

The RISC-V ISA was originally designed at University of California, Berkeley, by a team of researchers who have half a century of deep knowledge and experience on ISAs, computer

architectures, computer systems, and ecosystem software (the Sparc ISA also came out of Berkeley in the 1980s). As such, the RISC-V ISA can be seen as a new, clean-slate design that incorporates all the lessons learned from all previous ISAs dating all the way back to the 1950s.

An ISA is useless without a strong *ecosystem* supporting the ISA. This includes artefacts such as compilers, programming language implementations, debuggers, test and verification infrastructure, embedded operating systems, real-time operating systems, workstation and server-class operating systems, boot loaders, device drivers, and so on. It also includes services, such as tutorials, books, courses, and training materials (this book can be seen as a contribution). The RISC-V ecosystem is already quite rich, and it grows daily precisely because of the open nature of the ISA, enabling thousands of contributors worldwide to participate in the effort.

The openness of the RISC-V ISA also enables entrepreneurs and researchers to attempt *innovations* in micro-architecture and design and production techniques, something which is not practically feasible with a proprietary ISA.

The RISC-V ISA is unusually *modular* (more details in Section 2.4). It consists of a *very* small “base” Integer ISA and a number of optional standard extensions (such as Integer Multiply/Divide, floating point, Atomics, and Compressed instructions), and a systematic way of substituting or adding new, non-standard, proprietary extensions. This makes it easier for an entrepreneur to “tune” the ISA for their proprietary implementation aimed at a target market with specific requirements..

There is also a security dimension to RISC-V’s attractiveness. When a customer buys a silicon implementation of an ISA, they have to trust that neither the vendor nor the supply chain inserted any secret “back-doors” by which others can monitor what the processor is doing or, worse, disable it or program it remotely. The RISC-V ISA being open, it is more possible for a customer to develop their own trusted supply chains for silicon implementations.

Silicon implementations of RISC-V are already available from dozens of vendors located in several countries worldwide. These supply chains are only likely to grow more diverse. Many production-ready, competitive RISC-V designs are available in free and open-source form.

## 2.4 Overview of the RISC-V ISA

**NOTE:** More detailed information on the topics of this section can be found in:

- RISC-V ISA specification documents. Please see Appendix A.2 for links.
- Formal specification of the RISC-V ISA, written in the Sail language. Please see Appendix A.2 for links.
- RISC-V Assembly Language manuals Please see Appendix A.4 for links.

The RISC-V ISA is designed to be highly modular. Figure 2.2 shows the major components. The foundations (*base* integer ISA) are RV32I and RV64I, for implementations with 32-bit and 64-bit register widths, respectively. Technically these are two separate ISAs, but all but two of the RV32I instructions have identical counterparts in RV64I, so one can think of RV64I as a superset of RV32I, as suggested in the diagram. RV32I has just 40 instructions. RV64I slightly modifies two of them and adds a few more instructions.



Figure 2.2: Modularity of the RISC-V ISA

To the base ISA one can add several standard optional ISA extensions:

- M: a few instructions for integer multiplication and division.
- A: a few instructions for atomic memory operations (atomic read-modify-write of locations in memory).
- F,D: several instructions for IEEE single- and double-precision floating point operations.
- C: several so-called “compressed” instructions, which are only 16-bits wide, for applications where it is important for the code size to be small.
- Vector extension for vector arithmetic for scientific, AI and high-performance computing.
- Other extensions like SIMD (Single Instruction, Multiple Data), Bit Manipulation, useful in image processing, cryptography, *etc.*

The standard Privileged ISA is shown in the bottom half of the diagram. This consists of three “privilege” levels U (User, low), S (Supervisor) and M (Machine, high), with increasing capabilities with respect what aspects of the architectural state are visible and updatable.

The transition into the Privileged ISA only happens through a few carefully managed gateways. Thus, the whole standard Privileged ISA can easily be substituted with some other, non-standard, Privileged ISA, should the implementor find that beneficial.

Simple RISC-V implementations for very small embedded systems may implement only one privilege level (M). Since there is only one privilege level, there are no relative protections—all code can access all architectural state.

Slightly more secure RISC-V implementations for embedded systems may implement two privilege levels (M and U). Now, code running at U privilege can be prevented from accessing (and damaging) certain parts of architectural state such as devices, regions of memory, *etc.*

Most medium- to large-size systems implement all three privilege levels, M, S and U. Typically, user code runs at U privilege; operating systems such as Linux run at S privilege, and low-level device-access codes run at the highest (M) privilege. When S is implemented, code running at both U and S privilege levels can also use *virtual memory*.

## 2.5 Instruction encodings

All RISC-V instructions are encoded in 32 bits (except for the C extension, where they are encoded in 16 bits). Although there are hundreds of instructions if we count RV32I, RV64I, M, A, F, D, C, and Privileged ISA, they are all encoded in just a few 32-bit formats, shown in the Unprivileged spec, page 130, reproduced here in Figure 2.3. The labels on the

| 130          | Volume I: RISC-V Unprivileged ISA V20191213 |     |    |     |        |             |    |        |        |        |
|--------------|---------------------------------------------|-----|----|-----|--------|-------------|----|--------|--------|--------|
|              |                                             |     |    |     |        |             |    |        |        | 0      |
| 31           | 27                                          | 26  | 25 | 24  | 20     | 19          | 15 | 14     | 12     | 11     |
| funct7       |                                             | rs2 |    | rs1 | funct3 |             | rd |        | opcode | R-type |
| imm[11:0]    |                                             |     |    | rs1 | funct3 |             | rd |        | opcode | I-type |
| imm[11:5]    |                                             | rs2 |    | rs1 | funct3 | imm[4:0]    |    | opcode |        | S-type |
| imm[12 10:5] |                                             | rs2 |    | rs1 | funct3 | imm[4:1 11] |    | opcode |        | B-type |
|              | imm[31:12]                                  |     |    |     |        |             | rd |        | opcode | U-type |
|              | imm[20 10:1 11 19:12]                       |     |    |     |        |             | rd |        | opcode | J-type |

Figure 2.3: RISC-V Instruction Encodings

right-hand-side are suggestive of the class of instructions that use that coding:

- R: “register class”: typically two register inputs
- I: “immediate class”: typically one register and one immediate input
- S: “store class”: for the STORE instructions SB, SH and SW
- B: “branch class”: for the conditional branch instructions
- U: “Upper Immediate class”: for LUI (Load Upper Immediate) and AUIPC (Add Upper Immediate to PC)
- J: “Jump class”: for jump instructions JAL and JALR

We use standard Verilog “bit-slice” notation to refer to bit-fields of the 32-bit instruction. Thus, instr[6:0] refers to the lower 7 bits (bits 0 through 6) of the 32-bit instruction.

The overall “operation code” (opcode) of an instruction is a combination of 6-bit opcode field in instr[6:0] and the funct3 and funct7 fields at other positions in the instruction.

If the instruction reads at least one register, the first one is specified in the rs1 field (“register source 1”); the 5 bits in instr[15-19] specify one of the 31 registers. If the instruction reads two register, the second one is specified in the rs2 field; the 5 bits in instr[24-20] specify one of the 31 registers. If the instruction writes a register, it is specified in the rd field (“register destination”); the 5 bits in instr[11-7] specify one of the 31 registers.

In RISC-V, reading register x0 always yields the value 0. This is a useful convenience built into the ISA. For example, there is an instruction to test if the rs1-value is less than the rs2-value. If we specify x0 for rs2, then we effectively test if the rs1 value is negative.

Some instructions take input data from bits in the instruction itself; these are called “immediate” fields. In Figure 2.3 we can see that there are various encodings for the immediate fields. Consider the J-type instruction. Figure 2.4 shows how the 20 “imm” bits in the instruction are permuted, with a 0 bit appended to the right, to produce a 21-bit immediate value to be used in the computation.



Figure 2.4: Construction of 21-bit immediate from 20 “imm” bits in J-type instructions

Similarly, Figure 2.5 shows how the 12 “imm” bits in the instruction are permuted, with a 0 bit appended to the right, to produce a 13-bit immediate value to be used in the computation.



Figure 2.5: Construction of 13-bit immediate from 12 “imm” bits in B-type instructions

In both of the above, the encoding takes advantage of the fact that Jump and Branch target PCs are always at least 2-byte aligned, and therefore we do not have to use up an instruction “imm” bit to represent the “0” least-significant bit. The extra bit in the immediate value effectively doubles the “distance” that a jump or branch can span.<sup>1</sup>

The fact that there are so few encoding formats justifies the “RISC” in the ISA name: Reduced Instruction Set Computer. Having fewer formats, with few or zero special cases, simplifies the hardware needed to decode an instruction.

## 2.6 Unprivileged ISA RV32I

Figure 2.6 reproduces the table on page 130 of the RISC-V Unprivileged ISA specification, which shows all RV32I instructions. On the right-side of the diagram we have labeled different classes of instructions.

Note that there are just 40 instructions (again justifying the “RISC” name).

### 2.6.1 “Upper Immediate” instructions LUI and AUIPC

Figure 2.7 reproduces the top of page 19 of the RISC-V Unprivileged ISA specification, which describes the semantics of the LUI and AUIPC instructions in prose text.

<sup>1</sup>Actually in RV32I and RV64I PC targets are 4-byte aligned. The 2-byte alignment here accommodates the optional C (Compressed) extension, where instructions may be only 2-byte aligned.

| RV32I Base Instruction Set |       |      |                          |
|----------------------------|-------|------|--------------------------|
| imm[31:12]                 |       | rd   | 0110111                  |
| imm[31:12]                 |       | rd   | 0010111                  |
| imm[20:10:1 11 19:12]      |       | rd   | 1101111                  |
| imm[11:0]                  | rs1   | 000  | 1100111                  |
| imm[12:10:5]               | rs2   | rs1  | 000 imm[4:1 11]          |
| imm[12:10:5]               | rs2   | rs1  | 001 imm[4:1 11]          |
| imm[12:10:5]               | rs2   | rs1  | 100 imm[4:1 11]          |
| imm[12:10:5]               | rs2   | rs1  | 101 imm[4:1 11]          |
| imm[12:10:5]               | rs2   | rs1  | 110 imm[4:1 11]          |
| imm[12:10:5]               | rs2   | rs1  | 111 imm[4:1 11]          |
| imm[11:0]                  | rs1   | 000  | rd 0000011               |
| imm[11:0]                  | rs1   | 001  | rd 0000011               |
| imm[11:0]                  | rs1   | 010  | rd 0000011               |
| imm[11:0]                  | rs1   | 100  | rd 0000011               |
| imm[11:0]                  | rs1   | 101  | rd 0000011               |
| imm[11:5]                  | rs2   | rs1  | 000 imm[4:0] 0100011     |
| imm[11:5]                  | rs2   | rs1  | 001 imm[4:0] 0100011     |
| imm[11:5]                  | rs2   | rs1  | 010 imm[4:0] 0100011     |
| imm[11:0]                  | rs1   | 000  | rd 0010011               |
| imm[11:0]                  | rs1   | 010  | rd 0010011               |
| imm[11:0]                  | rs1   | 011  | rd 0010011               |
| imm[11:0]                  | rs1   | 100  | rd 0010011               |
| imm[11:0]                  | rs1   | 110  | rd 0010011               |
| imm[11:0]                  | rs1   | 111  | rd 0010011               |
| 0000000                    | shamt | rs1  | 001 rd 0010011           |
| 0000000                    | shamt | rs1  | 101 rd 0010011           |
| 0100000                    | shamt | rs1  | 101 rd 0010011           |
| 0000000                    | rs2   | rs1  | 000 rd 0110011 ADD       |
| 0100000                    | rs2   | rs1  | 000 rd 0110011 SUB       |
| 0000000                    | rs2   | rs1  | 001 rd 0110011 SLL       |
| 0000000                    | rs2   | rs1  | 010 rd 0110011 SLT       |
| 0000000                    | rs2   | rs1  | 011 rd 0110011 SLTU      |
| 0000000                    | rs2   | rs1  | 100 rd 0110011 XOR       |
| 0000000                    | rs2   | rs1  | 101 rd 0110011 SRL       |
| 0100000                    | rs2   | rs1  | 101 rd 0110011 SRA       |
| 0000000                    | rs2   | rs1  | 110 rd 0110011 OR        |
| 0000000                    | rs2   | rs1  | 111 rd 0110011 AND       |
| fm                         | pred  | succ | rs1 000 rd 0001111 FENCE |
| 0000000000000              | 00000 | 000  | 00000 1110011 ECALL      |
| 0000000000001              | 00000 | 000  | 00000 1110011 EBREAK     |

Annotations from top to bottom:

- LUI, AUIPC: "load immediate"-kind: to load constant values into a register
- JAL, JALR: "jump-and-link"-kind: subroutine calls and returns; distant jumps
- BEQ, BNE, BLT, BGE, BLTU, BGEU: "conditional branch"-kind: test and possibly jump up to ~0x1000 distance
- LB, LH, LW, LBU, LHU: "load data from memory into register" (rs1 and imm specify address)
- SB, SH, SW: "store data from register rs2 to memory" (rs1 and imm specify address)
- ADDI, SLTI, SLTIU, XORI, ORI, ORI: "integer arithmetic operations (register-immediate)"
- ANDI, SLLI, SRRI, SRAI: "integer arithmetic operations (register-register)"
- FENCE, ECALL, EBREAK: "system" operations (ignore FENCE for now)

Figure 2.6: RISC-V RV32I Instructions

Volume I: RISC-V Unprivileged ISA V20191213

19

SRLI is a logical right shift (zeros are shifted into the upper bits); and SRAI is an arithmetic right shift (the original sign bit is copied into the vacated upper bits).



LUI (load upper immediate) is used to build 32-bit constants and uses the U-type format. LUI places the U-immediate value in the top 20 bits of the destination register *rd*, filling in the lowest 12 bits with zeros.

AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in register *rd*.

Figure 2.7: RISC-V LUI and AUIPC Instruction semantics

Figure 2.8 is a pictorial depiction of the semantics of an LUI instruction. At the left of the



Figure 2.8: Execution semantics for LUI instructions

figure is depiction of the general behavior of a CPU, namely to loop forever, at each iteration fetching the instruction that is in memory at the address specified by the PC register, and then executing that instruction. For the LUI instruction, to the 20 bits [31:12] of the instruction (the “immediate” value) we append twelve bits of constant 0 as least-significant bits to construct a 32-bit value. This value is then written into register rd.

Figure 2.9 is a pictorial depiction of the semantics of an AUIPC instruction. As with LUI,



Figure 2.9: Execution semantics for AUIPC instructions

to the 20 bits [31:12] of the instruction (the “immediate” value) we append twelve bits of constant 0 as least-significant bits to construct a 32-bit value. For AUIPC we add this value to the 32-bits of the PC, and the result is then written into register rd.

### 2.6.2 Conditional BRANCH instructions

Conditional branch instructions compare the values in registers rs1 and rs2 for some condition: BEQ (equal), BNE (not-equal), BLT (less-than), BGE (greater-than-or-equal). If the condition is:

- false (“branch not taken”):
  - The instruction is a no-op, it just falls through to the next instruction ( $PC := PC + 4$ ).

- true (“branch taken”):
  - The 12 “imm” bits form a 13-bit value (see Figure 2.5).
  - This 13-bit value is sign-extended to 32 bits and added to the current PC to form a target address (“branch target”).
  - The PC is updated to contain the target address.

Because of sign-extension (+/-), the branch may be forwards or backwards relative to the current PC.

For BLT and BGE, the comparison treats the two input 32-bit values as *signed* values. For BLTU and BGEU, they are treated as *unsigned* values.

### 2.6.3 LOAD and STORE memory-access instructions

LOAD (LB, LH, LW, LBU, LHU) instructions move data from memory into a register. STORE (SB, SH, SW) instructions move data from a register to memory. In both cases, the address is formed by sign-extending the 12-bit immediate value to 32 bits and adding it to the value from register rs1. For LOAD instructions, the value loaded is placed into register rd. For STORE instructions, the value in register rs2 is stored to memory, and there is no result value written into any register.

As with most modern ISAs, RISC-V memory is byte-addressed, *i.e.*, each address refers to a specific byte in memory. The size of the data to be loaded/stored is given by “B” (1 Byte), “H” (Halfword = 2 bytes), or “W” (Word = 4 bytes) in the instruction name.

The difference between LB and LBU is whether the 8 bits loaded (1 byte) are sign-extended or zero-extended, respectively, to 32 bits when placed in the 32-bit rd. (And similarly for LH vs. LHU.)

How to read the ISA spec: examples LOAD/STORE, Register-Register Arithmetic and Logic, Register-Immediate Arithmetic and Logic, Unconditional Jump.

### 2.6.4 Register-Register Arithmetic and Logic instructions

These instructions read registers rs1 and rs2, perform the specified arithmetic/logic operation on the two values, and store the result into register rd.

SLT (“set if less than”) tests if register[rs1] < register[rs2], and writes 0 (false) or 1 (true) into the destination register, treating the inputs as 32-bit signed values. SLTU treats inputs as unsigned values.

SRL (“shift right logical”) shifts the value from register[rs1] to the right (towards the least-significant bits) by the amount in register[rs2]. Zeroes are shifted in from the left.

In SRA (“shift right arithmetic”), the register[rs1][31] is shifted in from the left, *i.e.*, the 32-bit value is treated as a signed integer (bit 31 is the sign bit).

In SLL (“shift left logical”), zeroes are shifted in from the right.

In all three shifts, since there are only 32 bits in register[rs1], only the lower 5 bits of register[rs2] are relevant (*i.e.*, bits [4:0]); the rest are ignored.

### 2.6.5 Register-Immediate Arithmetic and Logic instructions

These instructions read register rs1, form a signed 32-bit value from the 12-bit “imm” bits, and perform the specified arithmetic/logic operation on the two values, and store the result into register rd.

In SRLI (“shift right logical immediate”), SRAI (“shift right arithmetic immediate”), and SLLI (“shift left logical immediate”), the 5-bit **shamt** field provides the “shift amount”.

### 2.6.6 Unconditional Jump instructions

JAL (“jump and link”) forms a 21-bit value from the 20 “imm” bits (see Figure 2.4), sign-extends it to 32 bits, adds it to the current PC value and updates the PC with the result.

JALR (“jump and link register”) forms a 12-bit value from the 12 “imm” bits, sign-extends it to 32 bits, adds that to the value in register[rs1], forces bit [0] to be 0, and updates the PC to that value

Both JAL and JALR store the current PC + 4 (“return address”) into register[rd].

These are most often used for subroutine calls and returns. Figure 2.10 illustrates the protocol.



Figure 2.10: JAL and JALR for subroutine call and return

Suppose, inside function f1(), at address x, we have a **JAL ra,f2** instruction representing a subroutine call to function f2(). The compiler/linker will place “delta” in the JAL instruction’s “immediate” field, where delta is the difference between x and the entry point of f2(). In RISC-V Assembly Language, “ra” (“return address”) is another name for register[1], which is used to hold return addresses as part of the standard software calling convention. The JAL instruction saves x+4 (the return address) in register ra, and sets the PC to x+delta, so that the next instruction fetched will be the entry point of f2().

Inside f2(), at address y, suppose we have a **JALR 0,ra,0** instruction representing the return to the caller. It saves y+4 in register 0, but recall that writes to register 0 are always ignored. It reads the value in register ra, adds 0 to it and sets the PC to the result value, *i.e.*, the next instruction fetched will be from x+4.

JAL is used for most subroutine calls which are to a manifestly known subroutines (so the compiler/linker can compute the “delta” for the immediate value). JALR is used for subroutine calls where the “delta” is not known to the compiler/linker, for example when calling through a table of function pointers.

JALR is also used for subroutine calls and returns and conditional branches to “distant” target addresses. Remember that BRANCH instructions only have a 13-bit signed offset

from the current PC, and JAL only has a 21-bit signed offset. For more distant jumps, we can construct a full 32-bit value in a register (using one or more LUI and AUIPC instructions, for example) and then use that as the jump rs1 in a JALR instruction.

### 2.6.7 FENCE

The FENCE instruction is intended for RISC-V implementations that contain caches in the memory system. In such systems, data may not reach memory quickly or at all (is in the cache and is not evicted), and two items of data  $x_1$  and  $x_2$  may reach memory in a different order from the STORE instructions that wrote them (because the order in which their cache lines are written back to memory has no relation ship to the program STORE order).

The FENCE instruction is often used to “push” data from the cache to memory, although technically the FENCE instruction only guarantees an “ordering”, *i.e.*, that if there are two STOREs  $x_1$  and  $x_2$  before and after a FENCE, respectively, then  $x_1$  will be visible before  $x_2$  to any other agent in the system (another CPU, a DMA engine, an I/O device, *etc.*).

## 2.7 Traps due to illegal instructions and other exceptions, and CSRs

Most processors cannot afford to get “stuck” on an unrecognized instruction (illegal instruction). An illegal instruction raises one of several possible “exceptions”. Other events that raise exceptions are misaligned memory access (in FETCH or LOAD/STORE), or a memory-access to an unimplemented address (in FETCH or LOAD/STORE), *etc..*

NOTE: “Illegal” just means: outside the currently implemented subset. For example (see Figure 2.2), a legal RISC-V instruction in, say, the M extension is considered illegal in an implementation that only implements the RV32I subset.

Exceptions are treated like an “unexpected call” to a special subroutine called a “trap handler”. The architectural state is extended with a few extra registers belonging to a class of registers called CSRs (Control and Status Registers), shown in Figure 2.11.



Figure 2.11: CSRs for handling traps

To “return” from a trap handler we need one more instruction: MRET, which is in the “RISC-V Privilege M ISA” (see Figure 2.2). Figure 2.12 illustrates the flow into the trap

handler and back on an exception (an ILLEGAL instruction is one possible cause of the exception).



Figure 2.12: Trap and return flow

When the exception is detected, the hardware saves the faulting PC (PCx in the figure) in CSR MEPC, saves a cause-code in CSR MCAUSE and possibly one more piece of data (depending on the type of exception) in CSR MTVAL. Then, it sets the PC to the value in MTVEC, so that we start executing the trap-handler code.

When the trap-handler has finished, it executes an MRET instruction which copies the value in MEPC into the PC, continuing execution at that location.

Exception-cause codes relevant to this book (RV32I + exception handling) are shown in the table below (excerpted from Table 3.6 in the Privileged ISA Specification document).

| Exception-Cause code | Description                    |
|----------------------|--------------------------------|
| 0                    | Instruction address misaligned |
| 1                    | Instruction access fault       |
| 2                    | Illegal instruction            |
| 3                    | Breakpoint                     |
| 4                    | Load address misaligned        |
| 5                    | Load access fault              |
| 6                    | Store/AMO address misaligned   |
| 7                    | Store/AMO access fault         |
| ...                  | ...                            |
| 11                   | Environment call M-mode        |
| ...                  | ...                            |

The trap handler code can examine MCAUSE, MEPC and MTVAL to determine what it should do to handle the trap. Regarding MEPC, it may:

- Leave MEPC untouched, so that, on MRET, we retry the exceptional instruction.  
*Example:* the exception was a page-fault due to a FETCH on an unmapped page. The trap handler may map the page, and MRET to the faulting instruction to be retried (which, hopefully, should not page-fault again).
- Increment MEPC by 4, so that, on MRET, we resume at the next instruction after the faulting instruction.  
*Example:* the faulting instruction is a legal RISC-V instruction, but has deliberately been left unimplemented (e.g., to save hardware cost). The trap handler “emulates”

the instruction (*i.e.*, performs the required computation using implemented instructions), and resumes the normal flow at the next instruction.

*Example:* the trap handler just records that this instruction is illegal so that it can avoid it in subsequent execution.

- Change it to something else entirely, to abandon the current “normal” flow and do something else.

*Example:* in an Operating System, we save MEPC for future resumption, and change it to give another process a chance to execute.

### 2.7.1 ECALL and EBREAK instructions, and Interrupts

RV32I instructions ECALL and EBREAK are handled just like other exceptions; in that sense, these are not “unexpected” conditions, but exceptions deliberately induced by the program. The only notable feature is that they place certain exception-cause codes in the MCAUSE CSR (see “Environment call M-mode” and “Breakpoint”, respectively, in the table above).

In systems with operating systems, ECALL is used to move in a disciplined way from lower to higher privilege levels (from User mode to Supervisor mode, and from Supervisor Mode to Machine mode).

Some RISC-V implementations contain a hardware “Debug Module”, in which case EBREAK is treated differently (please see the RISC-V Debug Module specification document). It is used as part of the implementation of the “break” command in debuggers like GDB, LLDB and OpenOCD.

An external interrupt is also dispatched into the trap-handler just like exceptions, the only difference being the code placed in the MCAUSE register (see Table 3.6 in the Privileged ISA Specification document for all the defined interrupt cause codes).

### 2.7.2 CSRRxx instructions

In Section 2.7 we described how various CSRs are read and written during trap-handling. These reads and writes are performed *by the hardware* as part of taking a trap or performing an MRET.

But we also need a way to read and write CSRs *programmatically*, *i.e.*, from instructions in the program. For example, before any trap is taken, some early part of the execution needs to write the trap-vector’s PC into MTVEC. During trap-handling, the trap-handler needs to read (and possibly write) MEPC, MCAUSE and MVAL.

Programmatic access to CSRs is provided by a family of six “CSRRxx” instructions. The following table is excerpted from the RISC-V Unprivileged ISA spec document (Chapter 9, on the “Zicsr” extension). Each CSRRxx instruction:

- reads a value  $y$  from the CSR (whose 12-bit address is given in `instr[31:20]`);
- and writes  $y$  into a general-purpose register (`rd`);
- takes a value  $x$  either by reading from a general-purpose register (`rs1`), or using the `rs1` field itself as a literal 5-bit value and zero-extending it to XLEN bits;

## 9.1 CSR Instructions

All CSR instructions atomically read-modify-write a single CSR, whose CSR specifier is encoded in the 12-bit *csr* field of the instruction held in bits 31–20. The immediate forms use a 5-bit zero-extended immediate encoded in the *rs1* field.

| 31          | 20 19     | 15 14  | 12 11 | 7 6    | 0 |
|-------------|-----------|--------|-------|--------|---|
| csr         | rs1       | funct3 | rd    | opcode |   |
| 12          | 5         | 3      | 5     | 7      |   |
| source/dest | source    | CSRWR  | dest  | SYSTEM |   |
| source/dest | source    | CSRRS  | dest  | SYSTEM |   |
| source/dest | source    | CSRRC  | dest  | SYSTEM |   |
| source/dest | uimm[4:0] | CSRRCI | dest  | SYSTEM |   |
| source/dest | uimm[4:0] | CSRRSI | dest  | SYSTEM |   |
| source/dest | uimm[4:0] | CSRRCI | dest  | SYSTEM |   |

Figure 2.13: CSRRxx instructions (from Unprivileged Spec)

- and updates the value in the CSR using  $x$ .

Figure 2.14 illustrates the semantics of CSRRxx instructions. Some important points:



Figure 2.14: CSRRxx instruction semantics

- CSRRW and CSRRWI simply write the value  $x$  into the CSR.
- CSRRW and CSRRWI do not read the CSR if  $rd=0$ . Specifically, even *reading* a CSR can have side-effects for certain CSRs, so avoiding this read prevents those side-effects.
- In CSRRS and CSRRSI, for each “1” bits in  $x$ , it sets the corresponding bits in the CSR to 1.
- In CSRRC and CSRRRCI, for each “1” bits in  $x$ , it clears the corresponding bits in the CSR to 0.
- CSRRS, CSRRSI, CSRRC and CSRRRCI update the CSR only if *rs1* is not zero. Specifically, *writing* a CSR bit can have side-effects for certain CSRs (beyond just writing that bit), so avoiding this write prevents those side-effects.

For more details, please read Chapter 9 of the RISCV Unprivileged ISA specification.

Three CSRs are often used to measure performance of a RISC-V CPU: INSTRET, CYCLE and TIME, and these are described in the following sections.

### 2.7.2.1 CSRs INSTRET and INSTRETH

A RISC-V CPU contains a CSR to count the number of instructions that have executed completely (the technical term for which is “retired”). The count is maintained by the

hardware, incrementing on each retired instruction, and can be ready by software using a CSRRxx instruction.

The CSR is 64 bits wide, so on RV32 systems, reading the INSTRET CSR returns the lower 32 bits and reading the INSTRETH CSR (H for “high”) returns the upper 32 bits.

#### 2.7.2.2 CSRs CYCLE and CYCLEH

A RISC-V CPU contains a CSR to count the number of clock cycles that have elapsed. The count is maintained by the hardware, incrementing on each clock cycle, and can be ready by software using a CSRRxx instruction.

The CSR is 64 bits wide, so on RV32 systems, reading the CYCLE CSR returns the lower 32 bits and reading the CYCLEH CSR (H for “high”) returns the upper 32 bits.

#### 2.7.2.3 CSRs TIME and TIMEH, and memory-mapped location MTIME

Most modern computers have a so-called “real-time clock” which is a circuit that maintains precise actual elapsed time. Unlike clock cycles, which may vary dynamically based on instantaneous performance and power-saving demands, a real-time clock advances at a fixed rate. Because real-time clock circuits can be complex and expensive, they are not required to be part of a RISC-V CPU itself. Instead, they are memory-mapped devices—the CPU can issue a LOAD instruction to a particular address called the “MTIME” address to read a number representing “ticks” of a real-time clock. By making it a memory-mapped device, a real-time clock can also be shared by multiple CPUs.

RISC-V specifies the real-time clock to be a 64-bit memory-mapped register. The exact rate at which it advances is not specified, as long as it is a constant rate. Typically it “ticks” every few nanoseconds.

In addition to accessing the real-time clock via LOAD instructions at the MTIME address, the CPU can also access it via CSRRxx instructions using the CSR address TIME. In RV64 this returns the entire 64-bit value. In RV32 it returns the lower 32 bits, and a CSRRxx access to the CSR address TIMEH returns the upper 32 bits.

#### 2.7.2.4 Measuring CPU performance

Using the INSTRET and CYCLE CSRs, software can measure a standard metric for CPU implementations, CPI (Cycles per Instruction) or its inverse IPC (Instructions per Cycle). These measures can vary depending on the current mix of instructions, because different instructions may take a different number of clocks to complete. For example, simple integer-op instructions finish in few cycles, whereas the CPU may need more cycles to execute integer multiplication, division, and square root, and memory operations, and floating operations, and system instructions. Memory operation may take fewer cycles on a cache-hit and TLB-hit and more cycles on a cache-miss or TLB-miss. Further, modern advanced CPUs can dynamically vary their clock speeds depending on the instantaneous demand for performance and/or power-saving.

By reading the TIME CSR before and after an application, software can measure the actual elapsed time to run a particular application on a particular input data-set on a particular RISC-V CPU implementation.

## 2.8 RV64I differences from RV32I

The Architectural State for RV64I is just like that for RV32I, except that the PC and 32 registers are now 64-bits wide instead of 32-bits wide. Figure 2.15 reproduces the table of RV64I instructions from page 131 of the Unprivileged Spec document. RV64I starts

| Volume I: RISC-V Unprivileged ISA V20191213              |        |           |     |          |         |          | 131      |
|----------------------------------------------------------|--------|-----------|-----|----------|---------|----------|----------|
| 31                                                       | 27     | 26        | 25  | 24       | 20      | 19       | 15       |
|                                                          | funct7 |           | rs2 |          | rs1     |          | funct3   |
|                                                          |        | imm[11:0] |     |          | rs1     |          | funct3   |
|                                                          |        | imm[11:5] |     | rs2      | rs1     |          | imm[4:0] |
|                                                          |        |           |     |          |         | rd       | opcode   |
|                                                          |        |           |     |          |         | rd       | opcode   |
|                                                          |        |           |     |          |         | imm[4:0] | opcode   |
|                                                          |        |           |     |          |         |          | opcode   |
| <b>RV64I Base Instruction Set (in addition to RV32I)</b> |        |           |     |          |         |          |          |
| imm[11:0]                                                |        | rs1       | 110 | rd       | 0000011 | LWU      |          |
| imm[11:0]                                                |        | rs1       | 011 | rd       | 0000011 | LD       |          |
| imm[11:5]                                                | rs2    | rs1       | 011 | imm[4:0] | 0100011 | SD       |          |
| 000000                                                   | shamt  | rs1       | 001 | rd       | 0010011 | SLLI     |          |
| 000000                                                   | shamt  | rs1       | 101 | rd       | 0010011 | SRLI     |          |
| 010000                                                   | shamt  | rs1       | 101 | rd       | 0010011 | SRAI     |          |
| imm[11:0]                                                |        | rs1       | 000 | rd       | 0011011 | ADDIW    |          |
| 0000000                                                  | shamt  | rs1       | 001 | rd       | 0011011 | SLLIW    |          |
| 0000000                                                  | shamt  | rs1       | 101 | rd       | 0011011 | SRLIW    |          |
| 0100000                                                  | shamt  | rs1       | 101 | rd       | 0011011 | SRAIW    |          |
| 0000000                                                  | rs2    | rs1       | 000 | rd       | 0111011 | ADDW     |          |
| 0100000                                                  | rs2    | rs1       | 000 | rd       | 0111011 | SUBW     |          |
| 0000000                                                  | rs2    | rs1       | 001 | rd       | 0111011 | SLLW     |          |
| 0000000                                                  | rs2    | rs1       | 101 | rd       | 0111011 | SRLW     |          |
| 0100000                                                  | rs2    | rs1       | 101 | rd       | 0111011 | SRAW     |          |

Figure 2.15: RV64I instructions (in addition to RV32I)

with the same forty instructions as in RV32I, except that it replaces 3 of them—SLLI, SRLI and SRAI—with slight modifications: the “shamt” field is now 6 bits instead of 5, to accommodate 64-bit shifts.

In RV64I, the LW instruction loads 32-bits from memory, sign-extends it to 64 bits and stores it in the destination register. RV64I adds the LWU instruction that does the same, except that it zero-extends the 32-bit value from memory. RV64I also adds the LD instruction to load 64-bits from memory into a register.

RV64I adds the SD instruction to store a 64-bit value from a register into memory.

In RV64I, ADDI, SLLI, SRLI, SRAI, ADD, SUB, SLL, SRL and SRA all operate on 64-bit values. RV64I adds ADDIW, SLLIW, SRLIW, SRAIW, ADDW, SUBW, SLLW, SRLW and SRAW to operate on the lower 32-bits of 64-bit register values.

## **2.9 Continued Evolution of the RISC-V ISA (with your contribution?)**

The RISC-V ISA is not frozen. As shown in Figure 2.2, the ISA has consciously been designed in a modular way and to be modularly extensible. RISC-V International (RVI, <https://riscv.org>) runs an organized process for the continued maintenance and evolution of the ISA. Special-interest groups drawn widely from RVI membership constitute various committees under the aegis of RVI to propose, develop, and specify new extensions to the ISA (for cryptography, for image and video manipulation, for high-performance computing, for AI, and so on). There is a formal public-review and ratification process for any new proposed extensions.

RVI has the usual corporate membership tiers seen in many consortiums but, unusually, it also has inexpensive memberships for individuals (students, hobbyists, ...) and academic institutions. So please feel free to join, in order to monitor the RISC-V ecosystem closely or, even better, to actively contribute to its future,

## Chapter 3

# RISC-V interpreters: the Design Space from Software Functional Simulators to High-Performance Hardware

Any artefact/engine that executes the instructions of any ISA is an *interpreter* for that ISA. The classical meaning of an interpreter is an algorithm (program) that examines/traverses a data structure that is itself the representation of a target program, and performs actions accordingly. In our case, the target program is a RISC-V binary and the data structure is an array of RISC-V instructions. The algorithm examines RISC-V instructions in the array, conceptually one-instruction-at-a-time, and performs the instruction's actions.

Any algorithm can be implemented in software or in hardware. Further, the boundary is fluid: parts of the algorithm can be implemented in software, cooperating with other parts that are implemented in hardware (“accelerators”). The choice between software and hardware implementation is pragmatic (speed, power, cost, cost of debugging and modification, cost of redesign, *etc.*); functionally there is no theoretical difference.

When we implement an ISA interpreter in software, we call it a “simulator”. When we implement it in hardware, we call it a hardware implementation. Both software simulators and hardware implementations can vary widely in microarchitecture. Some design options are:

- Sequential or pipelined? One full instruction at-a-time, or multiple instructions flowing through a pipe, each at a more advanced step in its execution than the one behind it.
- Predictive (in pipelined implementations)? *E.g.*, predict what instructions to fetch while a BRANCH/JUMP flows through the pipe before we know the actual next-instruction determined the BRANCH/JUMP.
- Superscalar/VLIW? Fetch and execute more than one instruction in parallel, taking care to preserve sequential ISA semantics.
- Out-of-order? Execute each instruction as soon as its input data is available, without waiting for prior instructions which may still be waiting for their inputs.

For the same microarchitecture, a software simulator is typically *much slower* than a hardware implementation. This is because it involves (at least) two layers of simulation. The software simulator is itself a program that is being interpreted, perhaps directly in hardware. That program (the simulator), in turn, is interpreting the target ISA. The two interpreters need not and may not be for the same ISA. For example, if we run a RISC-V software simulator on a modern server, the lower level may be an x86 or ARM interpreter (*i.e.*, the CPU in the server). A software simulator written in Python or Java involves three layers of ISAs, *e.g.*, hardware x86/ARM interpreting x86/ARM instructions representing a program to interpret bytecode (second level ISA), which, in turn represents an interpreter for RISC-V programs. Every additional layer of interpretation can slow down overall performance by possibly orders of magnitude.

Paradoxically, adding any of the microarchitectural details mentioned in the list above will normally slow down a software simulator but speed up a hardware implementation. This is because those microarchitectural details expose more *parallelism* and *concurrency* in the interpretation algorithm. Hardware implementations actually execute these parallel actions in parallel, whereas a software simulator (written, say, in C/C++) may execute them sequentially (*i.e.*, *modeling* parallelism but in fact being sequential). Of course, the extra hardware speed is not free: it needs more hardware and more complexity in the design (cost, power consumption).

### 3.1 The RISC-V designs in this book

In this book we will focus on two simple hardware implementations. Both designs are coded in BSV, a free, open-source, modern, High-Level Hardware Design Language (HLHDL). BSV code can be compiled into Verilog, which can then be run on any Verilog simulator, or can be further processed by FPGA tools to run on FPGAs, or by ASIC tools for ASIC implementations. For more discussion of our choice of BSV, please see Appendix B.

Our first hardware RISC-V implementation—“Drum”—will be a simple one-full-instruction-at-a-time interpreter, almost a direct transliteration into BSV code of the generic ISA execution algorithm to be described next in Section 3.2. It does not implement any interesting microarchitectural feature, not even pipelining, which is the most basic microarchitectural feature of most CPU implementations. Lacking microarchitectural features, in fact the BSV code will look very similar to what you might write in C/C++ for a purely functional RISC-V simulator. Being written in BSV, however, we can compile and run it on actual hardware (FPGAs, ASICs).

Drum will not be fast compared to other hardware CPUs, because of lack of microarchitectural features, but we should still be able to run it at several 100 MHz on an FPGA, which will make it faster than many software functional simulators. It will be small (silicon area, and therefore low power as well). Drum is covered from Chapter 7 through Chapter 11.

Our second implementation—“Fife”—adds pipelining. Pipelining introduces new complications because of potential interaction between instructions that are at different stages in the pipe. We can focus on these new complications because all the functional aspects of RISC-V ISA execution have already been addressed in Drum. In fact, we will reuse the functional code from Drum without change. Fife is covered in Chapter 16 through Chapter 17.

For both Drum and Fife, we will focus initially on only the RV32I option of the RISC-V ISA. Please refer to the specification document “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA” [26]. In particular, look at Chapter 24 “RV32/64G Instruction Set Listings”, and the first table therein, entitled “RV32I Base Instruction Set”, showing forty instructions. These instructions are described in more detail in the same document in Chapter 2 “RV32I Base Integer Instruction Set, Version 2.1”.

We will extend this with just enough functionality to be able to recover from illegal instructions (*i.e.*, an instruction outside the set of forty RV32I instructions) and to handle interrupts. This minimal functionality will be taken from the specification document “The RISC-V Instruction Set Manual Volume II: Privileged Architecture”[27].

Beyond this book, we extend Drum and Fife to handle RV64I and more Unprivileged ISA options—M: integer multiply/divide, A: atomics, FD: single-and double-precision floating point, and C: compressed. We also handle more privileged ISA options—Privilege levels (M: Machine, S: Supervisor and U:User; full complement of Control and Status Registers (CSRs); Virtual Memory). With these extensions, Drum and Fife will be able to a full-feature Operating System (OS), such as Linux.

### 3.2 Abstract algorithm for interpreting an ISA

From our previous study of the RISC-V ISA, we know that the basic integer “architectural state” of a RISC-V CPU is very simple:

- A “program counter” (PC) indicating the address in memory of the next instruction to be executed.
- A “register file” consisting of 32 general purpose registers (GPRs), each containing data.

The PC and each register are either 32-bits wide (in the RV32 option of RISC-V) or 64-bits wide (in the RV64 option). For simplicity, we’ll focus on RV32 here, but everything we discuss also applies to RV64.

Interpreting a program involves the repetition of a few simple steps,<sup>1</sup> illustrated in Figure 3.1:

- The “Fetch” step reads the current value of the PC and uses that value as an address in memory from which to read an instruction. Then, we proceed to the “Decode” step.
- The “Decode” step examines the fetched instruction to check if it is legal, to classify its major category (such as Control, Integer Arithmetic/Logic, or Memory), and to extract some properties such as which GPRs it reads (if any) and which GPR it writes (if any). Then, we proceed to the “Register-Read and Dispatch” step.
- The “Register-Read and Dispatch” step reads the GPRs for the instruction’s inputs. Then, we proceed to one of the “Execute” steps, based on the category of the opcode in the instruction (Branch/Jump, Integer Arithmetic/Logic, or Memory).

---

<sup>1</sup>We prefer the word “step” here instead of “stage”, which we will reserve to refer to stages in a hardware pipeline such as Fife.



Figure 3.1: Simple interpretation of RISC-V instructions

- The “Execute Control” step is used for conditional-branch and jump instructions. For the former it evaluates the branch condition and, if true, and updates the PC to the branch-target PC. For jump instructions it updates the PC to the jump-target PC. Then, it goes back to the Fetch step to interpret the next instruction.
- The “Execute Integer Arithmetic and Logic” step is used for integer arithmetic and logic operations (addition, subtraction, boolean ops, shifts, *etc.*). Then, we proceed to the “Register-Write and Dispatch” step.
- The “Execute Memory Ops” step calculates a memory address based on an input value (that was read from a GPR) and reads or writes memory at that address. Then, we proceed to the “Register-Write and Increment PC” step.
- The “Register-Write and Increment PC” step writes the result from the previous Execute step back into a GPR, and increments the PC. Then, it goes back to the Fetch step to interpret the next instruction.

Thus we repeat these steps forever, instruction after instruction, starting each time at the Fetch step.

### 3.2.1 Memory latency and split-phase memory transactions

In Figure 3.1, note that the memory accesses, Fetch—Memory—Decode and Execute DMem—Memory—Execute DMem, are each shown with two arrows, one going to and another returning from memory. This is because, in any computer system, memory access is never “instantaneous”. A request is sent to memory in one “clock tick” and the response is returned no earlier than the next clock tick. We also say: memory-access always has some *latency*.

In fact it may take several clock ticks, and a varying number of clock ticks. Memory accesses can go through *cache* systems, which can return data quickly on a “hit” and take several cycles on a “miss”. Memory accesses can go to locations in I/O devices, which may be much “further away” in the memory subsystem, again possibly taking many cycles. Further, whether it takes one tick or several is not predictable—it can depend on the address of the request, and on past history of accesses which may leave caches in various states. It can also depend on technology replacement: next year’s memory system may be faster than today’s.

Thus,

- it is best to think of the path to memory (requests) and back (responses) as a *pipeline* or *queue* (FIFO) of unspecified length. The arrows in Figure 3.1 are decorated with small FIFO icons to emphasize this view.<sup>2</sup>
- Our CPU should not depend on any particular memory latency, *i.e.*, it should be robust to changes and variations in latency.

We also say that all memory transactions are *split-phase*: a request, followed later by a response (perhaps significantly later).

### 3.3 Plan for the order in which we tackle topics

This book serves two concurrent purposes: learning how to implement the RISC-V ISA and, specifically, how to implement it by coding it in BSV (“BSV learning”). The order in which we tackle topics is guided by the BSV-learning purpose, not by the step-by-step organization of Figure 3.1.

At the center of each step in Figure 3.1 are pure functions to decide what kind of instruction each 32-bit instruction is, perform arithmetic instructions, calculate addresses in memory, calculate conditions on whether to branch or not, etc. These pure functions are “combinational” functions, which we tackle in the next couple of chapters.

Note, we are *not* going to descend to the level of simple logic gates, how to optimize them, or how to implement higher-level combinational functions such as adders and multiplexers in terms of gates. These activities are today routinely handled by excellent compilers (“synthesis tools”). Our lowest-level combinational circuits, the ones we take as primitives, will be so-called “RTL-level” operators for arithmetic, shifts and logic operators on bit-vectors (+, -, <<, >>, &&, ||, ^, !, verb| |, and so on).

---

<sup>2</sup>Advanced memory systems may not even be FIFO-like; the CPU may receive responses in a different order from the original requests, in the interests of returning a response as soon as the data is available, rather than waiting for its turn. In this book we will assume FIFO-like, in-order memory-accesses.



# Chapter 4

## BSV: Top-level view of a BSV program

### 4.1 Introduction

In this chapter we describe the top-level view of a BSV program. The goal is to quickly develop an ability to scan BSV code (from Fife or Drum or any other example) and to discern the structure and relationships of the components in the code.

At the syntactic level, a BSV design can be organized into multiple source files, each containing a BSV “package”. Packages contain top-level definitions of types, values, functions, interfaces and modules. The names of these entities are local to the package. Components defined in one package are visible in other packages using an “import” statement.

At the semantic level, a BSV design is organized into hardware “modules”. Each module presents, to the outside world (the environment), an “interface” containing “methods”, which constitute the API by which the environment interacts with the module.

A module may contain “rules” which are independent processes. It can invoke instances of other modules by invoking their interface methods.

The sections of this chapter present an overview of these major components.

### 4.2 Packages and files

Like many programming languages, BSV has facilities to organize large programs into separately compilable, purposeful, reusable parts. A BSV program may be organized into one or more *files* or *packages*. This is illustrated in Figure 4.1.

A package can contain a number of top-level statements (more detail about this in Section 4.2.1). One kind of top-level statement is the “import” statement, illustrated in the diagram. This allows an importing package  $p_1$  to use all the identifiers exported by a package  $p_2$  (in addition to identifiers defined in  $p_1$  itself).

There is a one-to-one correspondence between files and packages. In each import statements in the diagram we provide a package name  $p$ ; the *bsc* compiler uses the imported package name  $p$  to find a file called  $p.bsv$  containing the definition of package  $p$ .



Figure 4.1: File-level view of a BSV program

#### 4.2.1 What's in a Package?

Figure 4.2 illustrates the typical structure and contents of a package/file.



Figure 4.2: What's in a BSV package/file?

The `package-endpackage` keywords bracketing the text—the first and last items in the file—are optional; if omitted, the `bsc` compiler will implicitly create a package name using the file name. We recommend always to use `package-endpackage`, except perhaps for small, experimental, one-off programs to test some small concept.

At the top-level of a package we find import/export statements, type declarations, value definitions, function definitions interface declarations and module definitions.

Import/export statements and type and interface declarations are only allowed at the top-level of a package and not inside any nested scopes. An interface declaration is just a kind of type declaration; its declared interface identifier is used just like a type.

Value, function and module definitions appear at package top-level, but can also be inside nested scopes (inside function and module definitions, or other local scopes like `begin-end` and `action-endaction` blocks). Function and module definitions are, in fact, just value definitions of functional and module type, respectively.

The top-to-bottom order of entities in a package is not important, just that if an entity  $x$  is defined in the package and used by another entity  $y$  in the same package, then  $x$  should be given before  $y$ . We typically place import/export statements at the top of the file.

Each package/file is separately compiled by the *bsc* compiler, which saves information about the compilation so that it won't be recompiled unless the source file has subsequently been modified.

#### 4.2.2 Visibility of names, exports and imports

The `package`, `import` and `export` statements work just like in SystemVerilog (in fact, using the same syntax). The `export` and `import` statements control visibility of names across packages. This is illustrated in Figure 4.3.



Figure 4.3: Namespace control with imports and exports

A package `P` that defines a name  $x$  can make it visible outside using an `export` statement. As a shorthand, if there is no `export` statement, then *all* the names defined in a package are made visible outside. For convenient readability, if there are many explicitly exported names, they can be provided using multiple `export` statements.

A package `Glurph` can also *re-export* names it has imported from some other package `Phrym`.

If a package `Phoux` needs to use a name  $x$  defined in some other package `Baz`, it must explicitly “`import`” the package with the syntax:

```
import Baz :: *;
```

which makes all the names exported by `Baz` available for use in `Phoux`.

**NOTE:** In SystemVerilog, in `import` statement, instead of “`*`” one can list just the names actually needed, one does not have to import all the names exported by another package. This selective import is not currently supported in BSV.

### 4.2.3 Resolving ambiguous imports

If a package `Phoux` imports two other packages `Baz` and `Glurph`, and both those packages define an identifier `x`, then the identifier `x` is ambiguous in `Phoux`. This can always be resolved by replacing any use of `x` in `Phoux` by a so-called *fully qualified name*, `Baz::x` or `Glurph::x` to identify exactly which `x` is intended.

### 4.2.4 Exporting types abstractly

When exporting a struct type `S` or enum type `E` using `export`, there is a difference between these two ways of exporting:

|                              |            |                        |
|------------------------------|------------|------------------------|
| <code>export S (...);</code> | <i>vs.</i> | <code>export S;</code> |
| <code>export E (...);</code> | <i>vs.</i> | <code>export E;</code> |

In the versions on the left the field names of the struct and the labels of the enum are also made visible; on the right, they are not. The latter case useful in defining a so-called “*abstract data type*”, *i.e.*, a type that is known outside the package but whose internal details are hidden.

Since an interface is just like a struct type, if we explicitly export it we typically use:

```
export M_IFC (...);
```

since we normally want all the interface methods in the interface to be visible.

## 4.3 Interface and Module Declarations

### 4.3.1 What’s in an interface declaration?

A BSV interface declaration can appear as a top-level declaration in a package. It is a syntactic specification of the “API” (Application Programming Interface) of one or more BSV modules, *i.e.*, it names the *methods* offered by a module, and their arguments and result types. Figure 4.4 shows examples of what one might find in an interface declaration (a top-level declaration inside a package).

The example declaration introduces a new interface type, `Baz_IFC`. Interface types may or may not have type parameters; the example shows two. Each type parameter can be of “numeric” type or ordinary “value” type. Numeric types are typically used for “sizes”; for example, the type “`Bit#(16)`” applies the type constructor “`Bit`” to the numeric type “16” to describe the type of bit-vectors of width 16 bits.

In places where we *use* this newly declared interface type, it may look something like this: `Baz_IFC#(3,Bool)`.

The order in which the method/sub-interface declarations are given is not important.

Some syntax notes for Figure 4.4:



Figure 4.4: Typical components in an interface declaration

- Syntax keywords: `interface`, `endinterface`, `method`, `numeric`, `type`
- Type names (begin with an uppercase letter):
  - `Action`, `ActionValue`, `Bit`, `Int`, `Bool` (all standard types in the *bsc* library)
  - `FIFO_O` (defined in an additional BSV library, and discussed in Section 8.5.4)
- Type variables (begin with a lower case letter): `n`, `t`
- Ordinary variables (begin with a lower case letter): `m1`, `x`, `y`, `z`, `m2`, `m3`, `fo_tags`

### 4.3.2 What's in a module declaration?

A module declaration specifies one of possibly many *implementations* of a module interface. If we think of an interface type as an API, then a module declaration describes an object (an actual circuit in hardware) that offers that interface. Figure 4.5 shows the typical contents of a module declaration.



Figure 4.5: Typical components in a module declaration

The order in which the components inside the module are given is not important except:

- If an item  $i_2$  uses an item  $i_1$  then  $i_1$ 's definition should precede  $i_2$ .
- By convention, sub-module instantiations are given first (the STATE section) followed by rules (the BEHAVIOR section).

- Method and sub-interface definitions must be the last items in the module (the INTERFACE section).

Some syntax notes for Figure 4.5:

- Syntax keywords: `module`, `endmodule`, `function`, `endfunction`, `rule`, `endrule`, `method`, `endmethod`, `interface`
- Type names (begin with an uppercase letter): `Bool`, `Baz_IFC`, `Int`, `Reg`, `FIFO`
- Ordinary variables (begin with a lower case letter): `mkBaz`, `verbosity`, `a`, `x`, `mkReg`, `f_tags`, `mkFIFO`, `foo`, `r1_R1`, `m1`, `fo_tags`

## 4.4 Rules and Interface Definitions

### 4.4.1 What's in a rule?

An “rule” is the fundamental behavioral construct in BSV<sup>1</sup>. Zero or more rules may appear at the top-level of a BSV module. Figure 4.6 illustrates the typical components of a rule.



Figure 4.6: Typical components of a rule

A rule condition is also known as its “explicit condition” and is always an expression of type `Bool` (and, therefore, by BSV’s strong type-checking rules, it cannot have a side- effect, *i.e.*, it cannot contain an Action).

The rule body as a whole is an expression of type `Action`, and may contain many sub-actions, also of type `Action` (`Action` is a recursively defined type).

BSV rules are discussed in more detail in Chapters 14 and 18. Rules are used in the Fife CPU (Chapter 17) and in a version of the Drum CPU (Chapter 15).

Some syntax notes for Figure 4.6:

- Syntax keywords: `rule`, `endrule`, `let`
- All the identifiers in the example are ordinary variables, beginning with a lower case letter.

### 4.4.2 What's in an interface definition?

Section 4.3.1 discussed interface type *declarations*. Here we discuss interface *definitions*. The former just specifies the names of the interface methods and their argument and result

<sup>1</sup>Whereas “`always @(posedge CLK)`” is the fundamental behavioral construct in Verilog and SystemVerilog.

types (like a C “function prototype”). The latter specifies how the methods are actually implemented in a particular module (like a full C function).

Interface definitions occur at the top-level of a module body, at the end of the module body (just before the `endmodule` keyword). Interface definitions must provide a definition for each method and sub-interface mentioned in the interface type declaration (the `bsc` compiler will issue a warning if any of them are left undefined).

Figure 4.7 illustrates typical components in an interface definition. All method definitions



Figure 4.7: Typical components of an interface definition

have an *implicit condition* that is always an expression of type `Bool`, and using identifiers defined earlier inside the module. If the implicit condition phrase is missing (as in the second method in the diagram), it is taken to be constantly true.

Action and ActionValue methods can contain Actions (*e.g.*, method `init`). Value methods cannot contain Actions (*e.g.*, method `read_epc`).

ActionValue and Value-methods must have a `return` statement specifying the returned value (*e.g.*, method `read_epc`).

Some syntax notes for Figure 4.7:

- Syntax keywords: `method`, `endmethod`, `if`, `return`
- Type names (begin with an uppercase letter): `Action`, `Bit`
- All the remaining identifiers in the example are ordinary variables, beginning with a lower case letter.

## 4.5 Static Elaboration and Hardware Module Structure

In developing a piece of software we think of two distinct phases, *static* and *dynamic*. The former is what the compiler does: parsing, type-checking, analysis and transformation, optimization and code-generation. The latter is what happens when we actually run the object code produced by the compiler: allocate a stack-frame for the top-level function, start executing it, and allocate and de-allocate stack frames as we call and return functions. Even in so-called interpreted languages (such as Python, Tcl, Perl, Javascript) these two phases exist, although they may be performed adjacently one after the other and may not be distinguishable to the user.

In BSV (and in Verilog, SystemVerilog and VHDL), there are *three* distinct phases: static (compilation), *static-elaboration* and dynamic. The compilation phase is just like in a software language compiler: parsing, type-checking, analysis and transformation, optimization and code-generation.

The new phase is static elaboration. We can think of the input to this phase as a collection of (parsed, type-checked) module definitions. One of these is designated as the *top-level module*,  $M_{top}$ . The static elaboration phase creates a hierarchy (a tree-structure) of module *instances*, starting with an instance  $MI_{top}$  of  $M_{top}$  at the root of the hierarchy. Effectively, it *unfolds* the module definitions into an actual hardware structure.

The tool creates an instance  $MI_{top}$  of the  $M_{top}$  module definition. Inside  $M_{top}$ , the code may specify instantiations of sub-modules  $M_1, M_2, \dots$ . The tool creates the instances  $MI_1, MI_2, \dots$  and connects them into  $MI_{top}$ . This process is followed recursively for sub-modules of  $M_1, M_2, \dots$  until the tool reaches “leaf” modules that do not instantiate any sub-modules. This is illustrated in Figure 4.8



Figure 4.8: Static elaboration

Note, as illustrated in the diagram, a module definition  $M_j$  may be instantiated more than once,  $MI_{ja}, MI_{jb}, MI_{jc}, \dots$  in various places in the hierarchy. In particular, the `mkReg` (register) and `mkFIFO` (FIFO) modules may be instantiated dozens of times, in dozens of modules.

In simulation (Bluesim or Verilog simulation), this static elaboration is performed once, at the start of the simulation.

When synthesizing to FPGA or ASIC, this static elaboration is performed by the synthesis tool once, at the start of its operations.

#### 4.5.1 Module interaction via methods

Figure 4.9 illustrates the fundamental *behavioral* structures in hardware produced from BSV. Modules contain rules, and rules interact with other modules by invoking the methods in their interfaces.

At this level of abstraction, this looks just like objects and interaction between objects in an object-oriented programming language, and this is reasonable as an approximate mental model of what happens in a BSV program. Some differences in BSV compared to traditional object-oriented programming languages:

Figure 4.9: Module interaction *via* methods

- Objects cannot be allocated dynamically (modules cannot be instantiated dynamically). All module instantiation is done once, during static elaboration.
- Rules in BSV modules are (potentially) infinite processes. A rule can “fire” repeatedly whenever its explicit condition and the implicit conditions of the methods it invokes are true.
- All methods in BSV have implicit conditions that dictate when they can be invoked.
- Each rule “firing” is *atomic*—all its state updates are performed together, conceptually at the same instant, and semantically either before or after all other rule firings, *i.e.*, there is never any interleaving of the actions in one rule firing with the actions in another rule firing.

Atomicity is the most powerful principle known in Computer Science for reasoning about correctness in the presence of concurrency (such as rules in a BSV program or multithreading in a concurrent software program). It is one of the key features that distinguishes BSV from Verilog, SystemVerilog, and VHDL (indeed possibly all other hardware design languages).

## 4.6 Conclusion

This chapter has given a top-level view of BSV programs, starting with a collection of files. With this knowledge, it should be possible to peruse a BSV program in multiple files in multiple directories (such as Drum or Fife) and start understanding its major structural components and how they are put together, *i.e.*, this chapter gives you the top-level structural “vocabulary” of BSV programs. It does not yet describe *how* the resulting hardware works; that will begin in subsequent chapters.



# Chapter 5

## BSV: Combinational circuits for the RISC-V step functions

### 5.1 Introduction

It is useful to start with the Decode step of Figure 3.1 because it involves bit-vectors, operations on bit-vectors, conditionals to classify instructions into classes, and `enum` types to name and encode instruction classes.

The inputs to the Decode step as depicted in Figure 3.1 are:

- A 32-bit piece of data—a RISC-V instruction—that has become available by reading it from memory at the PC address.<sup>1</sup>
- Any additional information passed on from the Fetch step.

The outputs of the Decode step have information needed by the next step (Register-Read and Dispatch). For a RISC-V instruction, useful information includes:

- Was the Fetch itself successful, or did it encounter a memory error; if so, what kind of memory error?
- Is it a legal 32-bit instruction?
- If legal, what is its broad classification: Control (Branch or Jump)? Integer Arithmetic or Logic? Memory Access? This will help in choosing the next step to which we must dispatch to execute the instruction.
- Does it have zero, one or two input registers? If so, which ones? This will help the next step in reading registers.
- Does it have zero or one output registers? If so, which one? This will help the final Register Write step in writing back a value to a register.

To compute these values, we need to examine “slices” of the 32-bit instruction (“bit vector”), such as the 7-bit “opcode” slice, the 5-bit “rs1”, “rs2” and “rd” slices, and so on. We need to be able to compare these slices to constants (*e.g.*, “Is the opcode a BRANCH opcode?”). We need to do things conditionally, *e.g.*, if it is a BRANCH instruction, then it has an rs1 and rs2 slice but no rd slice, but if it is a JAL instruction it has an rd slice but no rs1 or

---

<sup>1</sup>When implementing the so-called “C” RISC-V ISA extension (“compressed instructions”), instructions can also be 16-bits, but we ignore that for now.

rs2. Finally, as in all good programming languages, we'd need to be able to package all this functionality inside a "function" with clearly specified input(s) and output(s). In the next several sections—[5.2](#), [5.5](#), [5.6](#), [5.11](#), —we will learn the BSV concepts needed to code these ideas.

## 5.2 Bit Vectors

In BSV, as in many programming languages, every value has a *type*. The simplest, and lowest-level type in BSV is the bit-vector (a vector made up of a particular number bits). Later we will see that in BSV one can define more abstract types such as integers, booleans, vectors and arrays, lists, structs (records), tagged unions (algebraic types), trees, and so on. However, ultimately, any such value is represented in hardware as a bit-vector.

The BSV statement:

```
1 Bit #(32) pc_val;
```

declares the identifier `pc_val` to have the type `Bit#(32)`, *i.e.*, a bit-vector of 32 bits. The general syntax is similar to C or Verilog:

```
type identifier;
```

The BSC type `Bit#(32)` is roughly equivalent to the C type `uint32_t`. Unlike C, where only a few sizes are available, all multiples of 8 bits—(`uint8_t`, `uint16_t`, `uint32_t` and `uint64_t`)—bit-vectors in BSV can have any size (`Bit#(3)`, `Bit#(51)`, `Bit#(512)`, ...).

The bits in a BSV bit-vector of size  $n$  are indexed from  $n - 1$  (most-significant bit) to 0 (least-significant bit). You can extract a *slice* of a bit-vector using usual Verilog notation:

```
1 Bit #(32) pc_val;
2 Bit #(12) page_offset = pc_val [11:0];
```

In the second line, we extract 12 bits of `pc_val` to get a bit-vector of size 12. BSV is *strongly typed* with respect to sizes, *i.e.*, it is very strict about matching sizes. For example, this statement:

```
1 Bit #(12) page_offset = pc_val [10:0];
```

will be reported as a type-error by the `bsc` compiler because the slice-expression on the right-hand side has type `Bit#(11)` which does not match the declared type `Bit#(12)`.

### 5.2.1 Built-in Operators on Bit Vectors

BSV bit-vectors can be compared for equality and inequality. BSV bit-vectors are synonymous with unsigned integers, and so a number of other operations are also available on bit-vectors. Examples:

```

1 Bit #(12) x, a, b, c, d, e, f;
2
3 // Comparison ops: result type is Bool
4 if (a == b) ...;           // equality
5 if (a != b) ...;          // not-equal to
6 if (a < b) ...;           // less-than
7 if (a <= b) ...;          // less-than-or-equal-to
8 if (a > b) ...;           // greater-than
9 if (a >= b) ...;          // greater-than-or-equal-to
10
11 // Arithmetic ops: result type is Bit #(12)
12 x = a + b - c * d;       // add, subtract, multiply
13
14 // Bitwise logic ops: result type is Bit #(12)
15 //   AND  OR  unary INVERT  XOR  XNOR  XNOR
16 x = a & b | (~c)           ^ d ^^ e ^^ f;
17
18 // Shifts
19 x = (a << 3) & (b >> 14); // left- and right-shift

```

Please see the *BSV Language Reference Guide* [2], Section 10.3, “Unary and binary operators” for a full list of available unary and binary operators. Unlike Haskell, in BSV you cannot define new unary or binary infix operators.

In such expressions, as usual bit-vector sizes must match exactly, else we’ll get a type error, *e.g.*, we cannot compare a `Bit#(12)` value with `Bit#(11)` value. Unlike C and Verilog, BSV does not implicitly extend or truncate bit-vectors to match sizes.

Two functions are available to zero-extend and truncate bit-vectors.

```

1 Bit #(12) a;
2 Bit #(10) b;
3 b = a;                      // Type error: mismatched sizes
4 a = b;                      // Type error: mismatched sizes
5 b = truncate (a);           // Ok; truncates a to Bit #(10), then assigns
6 a = zeroExtend (b);         // Ok; extends b to Bit #(12), then assigns
7 if (a == zeroExtend (b)) ... // Ok
8 if (truncate (a) < b) ...    // Ok

```

The functions `truncate()` and `zeroExtend()` are *polymorphic* in that they will truncate/extend by the appropriate amount as demanded by the context.

### 5.3 Integer types

BSV has four integer types, written as follows:

```

1 Bit #(n)      // bit-vectors, bounded to n bits
2 Int #(n)      // signed integers, bounded to n bits
3 UInt #(n)     // unsigned integers, bounded to n bits
4 Integer        // Mathematical integers (unbounded, no bit-width limit)

```

The most common type used in processor design (and possibly in all hardware design) is `Bit#(n)`. `UIInt#(n)` can be used to represent unsigned integers,  $n$  bits wide. But since essentially the same operators (`+`, `-`, `&`, `|`, shifts, ..., see Section 5.2.1) are defined on both these types, this author mostly uses `Bit#(n)` for unsigned integers, and rarely uses `UIInt#(n)`.

`Int#(n)` is used whenever we need to represent negative numbers and perform signed operations. These are represented in bits in hardware in “2’s complement” representation (see [https://en.wikipedia.org/wiki/Two%27s\\_complement](https://en.wikipedia.org/wiki/Two%27s_complement)).

`Integer` is used for true mathematical integers, *i.e.*, they are unbounded, ranging from minus infinity to plus infinity (in practice, limited by the amount of memory in your computer!). Being unbounded, they cannot be represented in any fixed-size hardware. `Integer` is used only for compile-time integers, where we do not need to fix a bound.

Fixed-width integer types all “wrap-around”. For example, if we add 1 to a `Bit#(3)` value `3'b_111`, the result will be `3'b_000`; if we subtract 1 from a `Bit#(3)` value `3'b_000`, the result will be `3'b_111`.

Note that all four type names begin with an upper-case letter. This is true of all types in BSV: type names and enumeration constants begin with an upper-case letter, other identifiers begin with a lower-case letter. (See also Section 5.9.)

## 5.4 Hexadecimal and Binary Notation for literal integers

BSV uses the same notation as Verilog and SystemVerilog for hexadecimal and binary literal integers. Some examples:

```

1 3'b010          // Binary literal, 3 bits wide
2 7'b_110_0011   // Binary literal, 7 bits wide
3 5'h3           // Hex literal, 5 bits wide
4 32'h3          // Hex literal, 5 bits wide
5 32'h_ffff_0f17 // Hex literal, 32 bits wide (an AUIPC instruction)
6 'h23           // Hex literal, context determines width

```

As these examples show, a hexadecimal or binary integer literal is introduced by an optional bit-width, then a “tick” (single-quote) character, and then the binary or hexadecimal digits for the number. The character “`_`” may be used freely to space out groups of digits to improve readability for humans (the compiler ignores these spacers).

The last line shows that we can omit the size prefix, in which case the size will be inferred by the compiler from the context. For example, if we had:

```

1 Bit #(32) pc_val = 'h_8000_0000;

```

then the literal is inferred to be 32 bits wide.

## 5.5 Boolean values

In BSV, `Bool` is the type of a Boolean value. It has the usual boolean operators `&&` (Boolean/logical AND), `||` (Boolean/logical OR) and `!` (Boolean/logical NOT).

### 5.5.1 Caution: Bool and Bit#(1) are different types

BSV is unlike languages like C and Python which are very loose about what can be used as a boolean value. For example in C, any non-zero numeric value or pointer is considered “True”.

In BSV, `Bool` and `Bit#(1)` are *distinct* types, *i.e.*, `bsc`’s type-checking will complain if one is used where the other is expected. This is because not all `Bit#(1)` values are meaningful as Boolean values.

The Boolean/logical operators mentioned above (such as `&&`) operate on `Bool` types and are distinct from the bit-wise logic operators mentioned earlier (such as `&`), which operate on `Bit#(n)` types.

Note that bitwise comparison operators, such as in the example `if (a <= b) ...` shown in Section 5.2 above, take `Bit#(n)` arguments and produce `Bool` results.

### 5.5.2 Example: recognizing legal RISC-V BRANCH instructions

The RISC-V ISA has a family of six conditional-branch instructions. Figure 5.1 is an excerpt from the Unprivileged ISA specification document [26]. The first line just gives us

| 31           | 27 | 26  | 25 | 24  | 20  | 19          | 15          | 14 | 12     | 11 | 7 | 6      | 0    |  |
|--------------|----|-----|----|-----|-----|-------------|-------------|----|--------|----|---|--------|------|--|
|              |    |     |    |     |     |             |             |    |        |    |   | B-type |      |  |
| imm[12:10:5] |    | rs2 |    | rs1 |     | funct3      | imm[4:1 11] |    | opcode |    |   |        |      |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 000 | imm[4:1 11] | 1100011     |    |        |    |   |        | BEQ  |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 001 | imm[4:1 11] | 1100011     |    |        |    |   |        | BNE  |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 100 | imm[4:1 11] | 1100011     |    |        |    |   |        | BLT  |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 101 | imm[4:1 11] | 1100011     |    |        |    |   |        | BGE  |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 110 | imm[4:1 11] | 1100011     |    |        |    |   |        | BLTU |  |
| imm[12:10:5] |    | rs2 |    | rs1 | 111 | imm[4:1 11] | 1100011     |    |        |    |   |        | BGEU |  |

Figure 5.1: RISC-V conditional BRANCH instructions

the names of the various slices of a 32-bit BRANCH-type instruction, and the subsequent lines describe the six instructions. Note that they only differ in the `funct3` slice, where they use only six of the possible eight 3-bit codes.

Assuming `instr` is a 32-bit instruction, we can write BSV code to compute whether `instr` is or is not a legal BRANCH instruction:

```

1 Bit #(7) opcode_BRANCH = 7'b_110_0011;
2
3 Bit #(7) opcode = instr [6:0];
4 Bit #(3) funct3 = instr [14:12];
5 Bool legal = (opcode == opcode_BRANCH)
6     && (funct3 != 3'b010)
7     && (funct3 != 3'b011));

```

Line 1 defines `opcode_BRANCH` as a 7-bit constant whose binary value is 110011. The ‘7b prefix indicates that the number should be read as a binary, not decimal, number. The “\_”

underscore characters are present merely for our (human) readability, and have no semantic significance. Lines 3-4 extract relevant slices, and finally lines 5-7 define the desired legality condition.

Figure 5.2 shows the hardware circuit described by the code. Some observations:



Figure 5.2: Testing for a legal BRANCH instruction

- Lines with arrow-heads in the figure represent bundles of one or more wires, also called “buses”. For buses that have more than one wire, we show a small diagonal cross-hatch labeled with the number of wires (such as “3” or “7”).
- Names/identifiers in BSV code that are bound to values are simply names for buses (in most software programming languages names represent memory locations; this is *not* the case in BSV).

### 5.5.3 Combinational circuits and primitives

Figure 5.2 is an example of a so-called *combinational* circuit. In general, a combinational circuit is any interconnection of combinational primitive “operators” that *does not contain cycles* (*i.e.*, a bus connecting back to an earlier part of the circuit). Examples of combinational primitive operators in BSV include comparisons (like `==` and `!=`), boolean operations (like `&&`), bit-slicing (`[n1:n2]`) truncation and extension, arithmetic (like `+`, `-`, `*`), shifts (`<<` and `>>`), and multiplexers (discussed in Section 5.11, later).

In BSV, and Verilog/SystgemVerilog RTL, we consider such operators as “primitive”. In fact, such operators must themselves be implemented using lower-level circuit primitives such as AND, OR, and NOT gates which, in turn, must be implemented with even lower-level circuit structures such as transistors. We do not concern ourselves with such lower-level implementation because nowadays this is performed for us automatically by excellent so-called “synthesis” tools.

#### 5.5.3.1 Combinational circuits have no side-effects (are “pure”)

There is no “storage” in a combinational circuit, nor any concept of “updating” any storage (no “side-effects”). When a 32-bit value is presented at the input (top) of the circuit in Figure 5.2, conceptually we “instantly” see the 1-bit result at the output (bottom) of the circuit, *i.e.*, a combinational circuit is conceptually a pure, instantaneous, mathematical

function from inputs to outputs. If we change the 32-bit value presented at the input, conceptually the output changes instantaneously in response.

**NOTE:** Circuits are physical artefacts and must follow the laws of physics. Electrical signals will take some finite time to propagate from inputs to outputs through wires and silicon. This propagation delay will place a limit on the “clock speed” at which we are able to run a digital circuit. We ignore this for the moment, and discuss this in detail later.

## 5.6 Functions

The fragments of code shown above can be packaged into BSV functions, specifying precise types for argument(s) and result:

```
src_Common/Instr_Bits.bsv: line 31 ...
1 function Bit #(7) instr_opcode (Bit #(32) instr);
2     return instr [6:0];
3 endfunction
```

```
src_Common/Instr_Bits.bsv: line 35 ...
1 function Bit #(3) instr_funct3 (Bit #(32) instr);
2     return instr [14:12];
3 endfunction
```

```
src_Common/Instr_Bits.bsv: line 150 ...
1 function Bool is_legal_BRANCH (Bit #(32) instr);
2     let funct3 = instr_funct3 (instr);
3     return ((instr_opcode (instr) == opcode_BRANCH)
4             && (funct3 != 3'b010)
5             && (funct3 != 3'b011));
6 endfunction
```

Functions are invoked using the “application” syntax commonly used in most programming languages:

```
1 Bit #(32) x, y;
2
3 Bool result_x = is_legal_BRANCH (x);
4 Bool result_y = is_legal_BRANCH (y);
```

BSV function definition and application syntax is essentially the same as in SystemVerilog.

### 5.6.1 Pure functions vs. functions with side-effects (Action, ActionValue)

BSV has a system of data types and type-checking similar to Haskell in that it systematically distinguishes expressions which are “pure” (guaranteed not to have any side-effects) from expressions that may have side-effects.

The detailed reason for this distinction need not detain us now—suffice it to say here briefly that BSV’s semantics are fundamentally based on a concept of “rules”; that rules

are condition-action pairs; that conditions *must not* have side effects (change the state of any hardware); and that the compiler needs to guarantee this, *i.e.*, that conditions are pure boolean expressions. These points will be discussed in more detail in Chapter 14.

One useful side-effect during debugging is the `$display()` statement. Why is `$display()` considered a side-effect? A pure expression does not modify any state, and therefore it can be optimized away by the compiler (evaluated zero times) if its result is never used. The compiler can also duplicate a pure expression (so it is evaluated more than once) for reasons such as cost (cheaper to recompute a value at some location in the code than to communicate the value that was computed elsewhere). Neither of these properties is true with `$display()!` Thus, the compiler needs to know about the purity or otherwise of every expression.

Keeping track of the purity or otherwise of an expression is not a local property. An expression may invoke a function which, in turn, invokes another function, and so on. The expression is pure only if there are no side effects anywhere in such a call chain (those functions may be defined in separate files, in libraries, and so on). BSV used a *monadic* type system (the same as in Haskell) to systematically track purity/impurity of expressions. Consider two function declarations:

```

1  function Bool f1 (...); ...
2
3  function ActionValue #(Bool) f2 (...); ...

```

Both of these functions return a boolean value. But an application `f1(x)` is *guaranteed* to be pure (by BSV's type system), whereas an application `f2(x)` is assumed possibly to have a side-effect. These functions also have different syntax for how they are invoked:

```

1  let result1 = f1 (x);
2
3  let result2 <- f2 (x);

```

The “`<-`” syntax is/can only be used to invoke functions with `ActionValue` type, and is also a good visual cue to indicate that the invocation is *performing some action* (side-effect) and also returns a value.

BSV also has an `Action` type, which is just a convenience for the special case of an expression that has a side effect and does not return any interesting value. You can think of `Action` as a synonym for `ActionValue #(void)`, where you can think of `void` as “uninteresting value”. For example, you can think of `$display()` as a built-in function of type `Action`.

### 5.6.2 Combinational circuits = “doesn't have Action or ActionValue type”

Another pleasing consequence of BSV's type system is that we can identify precisely which expressions become combinational circuits. If the type of the expression is *not* `ActionValue#(t)` or `Action`, then it *must* be a combinational circuit, and *vice versa*.

### 5.6.3 Using ActionValue on pure functions for \$display debugging

Sometimes when we write a complex pure function whose result type is  $t$ , we may deliberately write its result type as  $\text{ActionValue} \#(t)$  so that we can insert `$display` statements for debugging inside the function body. If we merely inserted `$display` statements without changing the function type, the compiler will complain that the function does not type-check correctly.

We use this “trick” frequently in Drum and Fife source codes, for tracing and debugging. We will see many examples in the coming chapters.

## 5.7 A small testbench to test our code

Here is a small program to run our `is_legal_BRANCH()` function on a few tests:

```

1 import StmtFSM :: *;
2
3 function Bool is_legal_BRANCH (Bit #(32) instr);
4     ... as shown earlier ...
5 endfunction
6
7 (* synthesize *)
8 module mkTop (Empty);
9
10    mkAutoFSM (
11        seq
12            action
13                Bit #(32) instr_BEQ = {7'h0, 5'h9, 5'h8, 3'b000, 5'h3, 7'b_110_0011};
14                $display ("instr_BEQ %08h => %0d", instr_BEQ,
15                          is_legal_BRANCH (instr_BEQ));
16            endaction
17
18            action
19                Bit #(32) instr_BNE = {7'h0, 5'h9, 5'h8, 3'b001, 5'h3, 7'b_110_0011};
20                $display ("instr_BNE %08h => ", instr_BNE,
21                          fshow (is_legal_BRANCH (instr_BNE)));
22            endaction
23
24            action
25                Bit #(32) instr_ILL_op = {7'h0, 5'h9, 5'h8, 3'b100, 5'h3, 7'b_110_0000};
26                $display ("instr_ILL_op %08h => ", instr_ILL_op,
27                          fshow (is_legal_BRANCH (instr_ILL_op)));
28            endaction
29
30            action
31                Bit #(32) instr_ILL_f3 = {7'h0, 5'h9, 5'h8, 3'b010, 5'h3, 7'b_110_0011};
32                $display ("instr_ILL_f3 %08h => %0d", instr_ILL_f3,
33                          is_legal_BRANCH (instr_ILL_f3));
34            endaction
35        endseq);
36
37 endmodule

```

For the moment, don't try to understand all these boilerplate constructs in detail. Briefly, `mkAutoFSM` is like a sequential program (discussed in more detail in Section 11). It performs a sequence of four actions. In each action we define a 32-bit instruction with standard Verilog bit-concatenation syntax. For example, `instr_BEQ` is defined as a 32-bit value by concatenating a 7-bit hex 0 as an “immediate” value, a 5-bit hex 9 for rs2, a 5-bit hex 8 for rs1, a 3-bit 0 for funct3, a 5-bit hex 7 for rd, and a 7-bit binary value for the branch opcode. `instr_BEQ` and `instr_BNE` are legal branch instruction encodings. `instr_ILL_op` is not a legal branch instruction because it has the wrong 7-bit opcode in the opcode slice. `instr_ILL_f3` is not a legal branch instruction because it has an illegal 3-bit value in the funct3 slice.

In each action, the `$display()` prints the instruction in hex format, and prints the Bool result of applying `is_legal_branch()` to the instruction. In two of the `$display()`s we print the Bool value as a decimal integer (%0d format). In the other two `$display()`s we use `fshow()` to print booleans as “True” or “False”.

Suppose this code is in a file `Top.bsv`. We can now compile, link and execute the design (in simulation) as follows:

```

1 # ---- Compile BSV source code
2 $ bsc -u -sim Top.bsv
3 checking package dependencies
4 compiling Top.bsv
5 code generation for mkTop starts
6 Elaborated module file created: mkTop.ba
7 All packages are up to date.

8
9 # ---- Link to form a simulation executable
10 $ bsc -sim -e mkTop -o ./exe_HW_bsim
11 Bluesim object created: mkTop.{h,o}
12 Bluesim object created: model_mkTop.{h,o}
13 Simulation shared library created: exe_HW_bsim.so
14 Simulation executable created: ./exe_HW_bsim

15
16 # ---- Execute the simulator
17 $ ./exe_HW_bsim
18 instr_BEQ 009401e3 => 1
19 instr_BNE 009411e3 => True
20 instr_ILL_op 009441e0 => False
21 instr_ILL_f3 009421e3 => 0

```

### Exercise 5.1:

Extend the testbench to test more 32-bit values with `is_legal_BRANCH()`.

### Exercise 5.2:

Refer to the “RV32I Base Instruction Set” listing in “Chapter 24 RV32/64G Instruction Set Listings” in the RISC-V Unprivileged ISA specification document [26]. It lists 40 RV32I instructions. Similar to `is_legal_BRANCH()`, write BSV code for the following functions:

```

1  function Bool is_legal_JAL (Bit #(32) instr);
2      ... acccepts JAL
3
4  function Bool is_legal_JALR (Bit #(32) instr);
5      ... acccepts JALR
6
7  function Bool is_legal_OP (Bit #(32) instr);
8      ... acccepts LUI, AUIPC, ADD, SLT, OR, AND, ...
9
10 function Bool is_legal_OP_IMM (Bit #(32) instr);
11     ... acccepts ADDI, SLTI, ..., ORI, ANDI, ...
12
13 function Bool is_legal_Mem (Bit #(32) instr);
14     ... accepts LB, LH, LW, LBU, LHU, SB, SH, SW

```

Ignore FENCE, ECALL and EBREAK instructions; for the moment we'll treat them as illegal instructions.

### Exercise 5.3:

Extend the testbench to test more 32-bit values with all the `is_legal_XXX()` functions.

□

## 5.8 enum types

In Figure 3.1, the Register-Read and Dispatch step needs to know which of the four alternative downstream paths should be selected for executing the instruction. In particular, we need to know whether the incoming instruction is a system instruction, a control instruction (branch or jump), an integer arithmetic/logic instruction, or a memory-accessing instruction. We could think of coding these “classes” using numbers (0 for system, 1 for control, 2 for integer, 3 for Mem), but it is more readable, and cleaner, to use an “enum” type (similar to enum types in SystemVerilog and C):

```

src_Common/InterStage.bsv: line 40 ...
1  typedef enum {OPCLASS_SYSTEM,
2      OPCLASS_CONTROL,      // BRANCH, JAL, JALR
3      OPCLASS_INT,
4      OPCLASS_MEM,         // LOAD, STORE, AMO
5      OPCLASS_FENCE}       // FENCE
6
7  OpClass
8  deriving (Bits, Eq, FShow);

```

This defines a type `OpClass` containing five constants (the last two (`OPCLASS_MEM` and `OPCLASS_FENCE`) both indicate the memory-access path).

### 5.8.1 deriving (Bits)

Because we said “`deriving(Bits)`”, the *bsc* compiler will automatically represent them with the obvious codes 0, 1, 2, 3 and 4 in a minimal bit-width (`Bit#(3)`). If we wanted an alternative (non-default) coding for these constants, we would *not* say “`deriving(Bits)`”, and we would provide an explicit mapping function into codes (see “typeclass instances”, later).

### 5.8.2 deriving (Eq)

Because we said “`deriving(Eq)`”, the *bsc* compiler will automatically define the “equality” (and “inequality”) functions for values of this new type, in the natural and obvious way (here, just compare the bit representations for equality). For other definitions of equality/inequality, we would *not* say “`deriving(Eq)`”, and we would instead define equality/inequality explicitly (see “typeclass instances”, later).

### 5.8.3 deriving (FShow)

Given a value `opclass` of type `OpClass`, if we directly print it, *e.g.*:

```
1 opclass = OPCLASS_MEM;
2 $display ("opclass = %d", opclass);
```

the output will be “3”, which is the numeric code the compiler assigns to the constant `OPCLASS_MEM` given its position in the list of labels in the `enum` definition shown in Section 5.8 (starting at zero).

```
1 opclass = OPCLASS_MEM;
2 $display ("opclass = %d", opclass);
```

Because we said “`deriving(FShow)`” in the `enum` declararion, the *bsc* compiler will automatically define an “`fshow()`” function for this type: if we print as follows:

```
1 opclass = OPCLASS_MEM;
2 $display ("opclass = ", fshow (opclass));
```

the output will be “`OPCLASS_MEM`”, *i.e.*, the symbolic name.

## 5.9 Syntax of Identifiers

The syntax of an identifier (name) in BSV follows the same conventions as in many programming languages: any sequence of alphabets, digits and underscore characters, with the first letter always being an alphabet.

BSV follows the Haskell system where an identifier has a different roles depending on whether its first letter is lower-case or upper-case. An upper-case first-letter represents a *constant*, either a value constant or a type constant. All variables (value variables or type variables) begin with a lower-case letter.

In the enum type-definition in Section 5.8, the identifiers `OPCLASS_SYSTEM`, `OPCLASS_CONTROL`, `OPCLASS_INT` and `OPCLASS_MEM` are all value constants (they all begin with an upper-case letter). The identifier `OpClass` (and identifiers seen earlier: `Bool` and `Bit`) are all type constants. The identifiers `Bits`, `Eq`, and `FShow` are all typeclass constants.

Other variables seen earlier, like `x`, `y`, `a`, `b`, `opcode`, and `result_x` are all ordinary value variables.

## 5.10 Syntax of comments

Comments in BSV have the same syntactic conventions as in Verilog, SystemVerilog and C/C++:

- A pair of forward-slashes (“//”) begins a comment that spans to the end of the current line.

There are many examples of this in the code fragments already shown above.

- A region of text spanning multiple lines can be a comment if preceded by “/\*” (forward-slash and asterisk) and followed by “\*/” (asterisk and forward-slash).

This form is often used to “comment-out” a region of text during debugging or trying out alternatives.

## 5.11 if-then-else statements and hardware multiplexers

In most programming languages, “if-then-else” is a so-called “control” construct: depending on the boolean condition, either the then-arm or the else-arm is executed (*not both!*).

In BSV an “if-then-else” represents a hardware *multiplexer*. The then-arm and else-arm each represent hardware that computes some value. The if-then-else construct simply selects the output of one of the two arms and passes it on as its output. Stated another way, the if-then-else “multiplexes” the two arm-outputs into a single output. In programming-language terms, *both* arms of the conditional are always “executed”—each arm represents an actual piece of hardware that is continuously computing its output.

The data type of the condition in an if-then-else must always exactly be `Bool` (not `Bit#(1)`, not an integer, *etc.*). The types of the two arms of the conditional must be exactly the same, and this is also the type of the output of the output of the if-then-else.

For example, here is a function that distinguishes CONTROL instructions from integer instructions, returning an `OpClass` (Section 5.8):

```

1  function OpClass instr_opclass (Bit #(32) instr);
2    OpClass result;
```

```

3   if (is_legal_BRANCH (instr)
4     || is_legal_JAL (instr)
5     || is_legal_JALR (instr))
6       result = OPCLASS_CONTROL;
7   else
8     result = OPCLASS_INT;
9   return result;
10 endfunction

```

This can also be written using so-called “conditional expressions” (using the same syntax as in SystemVerilog and C):

```

1 function OpClass instr_opclass (Bit #(32) instr);
2   return ((is_legal_BRANCH (instr)
3         || is_legal_JAL (instr)
4         || is_legal_JALR (instr))
5         ? OPCLASS_CONTROL
6         : OPCLASS_INT);
7 endfunction

```

It's a matter of taste and style whether one uses if-then-else expressions or C-style conditional expressions. It may also depend on the size of the sub-expressions. The primary goal should be readability.

Both these code fragments represent the same hardware, shown in Figure 5.3. The 32-bit



Figure 5.3: If-then-else is a multiplexer

`instr` argument is fed into the circuits for `is_legal_BRANCH()` (hardware schematic in Figure 5.2), `is_legal_JAL()` and `is_legal_JALR()` which are OR'd to produce a Boolean output which, in turn, is used to select one of two 2-bit constant values, producing a final 2-bit result. The multiplexer, also called a “MUX” for short, is a primitive combinational circuit.

If-then-elses and conditional expressions can of course be nested:

```

1 function Bool instr_opclass (Bit #(32) instr);
2   OpClass result;
3   if (is_legal_BRANCH (instr)
4     || is_legal_JAL (instr)
5     || is_legal_JALR (instr))

```

```

6     result = OPCLASS_CONTROL;
7     else if (is_legal_OP (instr)
8         || is_legal_OP_IMM (instr)
9         || is_legal_LUI (instr)
10        || is_legal_AUIPC (instr))
11        result = OPCLASS_INT;
12    else if (is_legal_LOAD (instr)
13        || is_legal_STORE (instr))
14        result = OPCLASS_MEM;
15    else if (is_legal_ECALL (instr)
16        || is_legal_EBREAK (instr)
17        || is_legal_MRET (instr)
18        || is_legal_CSRRxx (instr))
19        result = OPCLASS_SYSTEM;
20    return result;
21 endfunction

```

This represents a cascade of multiplexers in hardware, as shown in Figure 5.4



Figure 5.4: Nested if-then-elses become cascaded multiplexers

### 5.11.1 Parallel multiplexers and MUX synthesis

The circuit in Figure 5.4 has a serial structure—the `OPCLASS_CONTROL` branch has priority, and only if its condition is False can one of the other results flow through. Also observe that the longest path length increases *linearly* with number of classes—here, `OPCLASS_SYSTEM` flows through all four multiplexers.

But we know from RISC-V instruction encodings that the `OPCLASS_CONTROL`, `OPCLASS_INT` and `OPCLASS_MEM` conditions are *mutually exclusive*; no instruction simultaneously falls into more than one such class. In such situations (mutually exclusive conditions) it is possible to create a more efficient circuit called a *parallel MUX*. An exercise below shows

how to create a parallel MUX explicitly, but in many cases downstream RTL-to-lower-level-hardware synthesis tools will do this automatically.

#### Exercise 5.4:

Write a testbench for the `instr_opclass()` function: pass in different 32-bit instructions to produce the op class, and print out the op class. When printing the class, try printing it as an integer, and also using `fshow()`.

#### Exercise 5.5:

Write a new version of the `instr_opclass()` function that expresses a parallel MUX instead of a priority MUX. The key ideas are:

- Define a value `x_CONTROL` that is either `OPCLASS_CONTROL`, or 0 (of the same bit-width) if the Bool values of `is_legal_BRANCH()`, `is_legal_JAL()` and `is_legal_JALR()` are all False. We can implement this by replicating the 1-bit Bool condition to the width of the `OpClass` type and bitwise-AND'ing this with `OPCLASS_CONTROL`.
- Similarly, define values `x_INT`, `x_MEM` and `x_SYSTEM` that are either `OPCLASS_INT`/ `OPCLASS_MEM`/ `OPCLASS_SYSTEM` or 0 depending on whether the instruction is an integer, memory or system instruction.
- Finally, bitwise-OR the four `x_XXX` values together to produce the result.

Figure 5.4 illustrates the desired circuit structure.



Figure 5.5: Nested if-then-elses using an AND-OR MUX (for mutually exclusive conditions)

This kind of MUX is also called an AND-OR MUX or “parallel” MUX because of its structure. It relies for correct operation on *precisely one* of the bitwise-OR arguments being True. Here, we are assured of this because of the mutual exclusivity of the conditions.

While the circuit-depth in the cascade-of-multiplexers is *linear* in the number of if-then-else arms, for the AND-OR MUX it is *logarithmic* (smaller circuit delay  $\Rightarrow$  possibly higher clock speed).

**Exercise 5.6:**

Test your new version of the `instr_opclass()` function in your testbench.



## 5.12 Case-expressions

In the special case where a nested if-then-else is simply testing a value against a series of alternative constant values, we can use a `case` expression. For example:

```
1      _____ from src_Common/Fn_EX_Control.bsv _____
2      Bool branch_taken = case (instr_funct3 (instr))
3          funct3_BEQ:  (rs1_val == rs2_val);
4          funct3_BNE:  (rs1_val != rs2_val);
5          funct3_BLT:  signedLT (rs1_val, rs2_val);
6          funct3_BGE:  signedGE (rs1_val, rs2_val);
7          funct3_BLTU: (rs1_val < rs2_val);
8          funct3_BGEU: (rs1_val >= rs2_val);
9      endcase;
```

**NOTE:** Note that BSV is like Verilog/SystemVerilog in that exactly one arm of the `case` expression is executed. This is unlike C/C++, where a case arm “falls through” to the next case arm, unless one has a `break`, `return` or `goto` statement.

In this example each case-arm is a pure value, and we call the whole construct a `case` expression. Often each case-arm is an `Action` (such as a register assignment) in which case we sometimes call it a `case` statement. They are really the same thing, differing only in the data types under consideration.

## 5.13 Sharing code for RV32 and RV64 via parameterization

The RISC-V ISA is actually two ISAs—a 32-bit ISA called RV32 and a 64-bit ISA called RV64. These are not randomly different ISAs; they have been carefully engineered to overlap as much as possible:

- Most of the RV32 instructions are exactly the same in RV64
- Three R32 instructions are slightly different in RV64—the shift instructions SLLI, SRLI and SRAI have 5-bit shift-amounts in RV32 (allowing up to 32-bit shifts), whereas they have 6-bit shift-amounts in RV64 (allowing up to 64-bit shifts).
- RV64 adds several new instructions that compute on 64-bit values.

Because of this large overlap of RV32 and RV64, we would like to share BSV code as much as possible between RV32 and RV64, *i.e.*, we would like to parameterize our BSV code so that it can be re-used between RV32 and RV64 implementations.

### 5.13.1 Numeric types

We have mentioned the type `Bit#(n)` frequently so far, representing bit-vectors of width  $n$  bits. All our examples showed a particular  $n$ , such as:

```
1 Bit #(32) instr;
2 Bit #(32) pc_val;
```

The first declaration is fine for both RV32 and RV64, since instructions are 32-bits wide in both. However, the second declaration only works in RV32, since the program counter is 64-bits wide in RV64 (type `Bit#(64)`).

The “32” or “64” argument to the `Bit#(n)` type is a *numeric type*. Although syntactically they look just like the *values* 32 and 64, when used inside a type-expression like `Bit#(n)`, they are not values, but numeric types. BSV’s type-system carefully distinguishes between these two cases because numeric-types usually say something about *hardware structure*, which cannot be changed once created! So, while we can perform arbitrary arithmetic on numeric *values*, we cannot do so on *numeric types*.<sup>2</sup>

### 5.13.2 Type synonyms

In BSV we can define a new symbolic name for an existing type, and then we can use that symbolic names in place of the existing type. Example, from RV32 code:

```
1 typedef 32 XLEN;           // new name for numeric type 32
2
3 Bit #(XLEN) pc_val;
4 Bit #(XLEN) rs1_val;    // Value read from register rs1 in register file
5 Bit #(XLEN) rs2_val;    // Value read from register rs2 in register file
6 Bit #(XLEN) rd_val;     // Value written to register rd in register file
```

By changing the single definition in line 1 to:

```
1 typedef 64 XLEN;           // new name for numeric type 32
```

the remaining code will work for RV64 as well.

### 5.13.3 The numeric value corresponding to a numeric type

Although BSV keeps a strict separation of numeric types and numeric values (and limits the available arithmetic on the former), it is always safe to convert a numeric type into the corresponding numeric value, since these values are all known statically (at compile time). The built-in pseudo-function `valueOf()` is provided for this:

---

<sup>2</sup>A limited form of arithmetic is possible on numeric types. Consider a generic function that takes two arguments of type `Bit#(m)` and `Bit#(n)` and returns the concatenation of these bit-vectors: its output type is `Bit#(m+n)`. By limiting the available arithmetic *bsc* can resolve it completely “statically”, *i.e.*, at compile time, before it even compiles to Verilog RTL. We ignore this for now, and discuss it later.

```
src_Common/Arch.bsv: line 26 ...
1 Integer xlen = valueOf (XLEN);
```

Here, `xlen` is an ordinary value variable whose integer value is the same as that expressed by the numeric type `XLEN`.

#### 5.13.4 Conditional compilation

Just like in Verilog, SystemVerilog and C/C++, the `bsc` compiler runs BSV source code through a “pre-processor” before compilation, which can perform simple text (“macro”) substitutions. Using this facility, we can pass an argument to the compiler that has the effect of configuring the source code for RV32 or RV64 (the following code is from Fife/Drum’s `Arch.bsv` file):

```
src_Common/Arch.bsv: line 14 ...
1 `ifdef RV32
2
3   typedef 32 XLEN;
4
5 `elsif RV64
6
7   typedef 64 XLEN;
8
9 `endif
```

As in Verilog and SystemVerilog, pre-processor directives begin with a ‘ character (back-tick) (analogous to `#ifdef` in the C/C++ pre-processor).

When we invoke the `bsc` compiler, we can pass it command line arguments `-DRV32` or `-DRV64`; the pre-processor will then select the appropriate `typedef` line. Thus, we can write common code that will work for both RV32 and RV64. The integer value `xlen` will have the numeric value 32 or 64 depending on how it was compiled.

Pre-processor macros allow us to conditionally compile different source text based on the macro definitions we supply to the compiler. We can also compile alternatives based on the value `xlen`

```
1 if (xlen == 32) begin
2   ... code that must execute if we are in RV32 mode ...
3 end
4 else begin
5   ... code that must execute if we are in RV64 mode ...
6 end
```

Whenever possible, it is preferable to use the `if(xlen==...)` form instead of the `'ifdef` form for conditional compilation because (a), the code is more readable and (b), as we experience in many languages, pre-processor macros can be quite dodgy (scoping, inadvertant variable capture, inadvertant surprises due to associativity of infix operators, ...).

Note that in the `if(xlen==...)` form both arms of the conditional must type-check correctly, whether `xlen` is 32 or 64. There are ways to achieve this with judicious use of

bit-slicing, `extend()` and `truncate()`; we will point them out as we encounter them. If the two arms cannot both type-check whether `xlen` is 32 or 64, we may have to resort to the ‘`ifdef`’ form.

There is zero run-time overhead in using the `if(xlen==...)` form because the *bsc* compiler will evaluate the if-condition statically and reduce the if-then-else to just the relevant arm.

# Chapter 6

## BSV: Struct types, tuples, and RISC-V: Memory requests and responses

### 6.1 RISC-V: structs communicated between steps

Various kinds of information need to be communicated between the steps of Figure 3.1—program counter values, instructions, values read from registers, values to be written back to registers, and so on. **struct** data types (short for “structures”) are suitable for bundling together heterogeneous collections of values. (This is the same concept in C and SystemVerilog; it is also called a “record” in some programming languages.) Each component of a struct is called a “field” or a “member” of the struct. Figure 6.1 annotates Figure 3.1 with struct



Figure 6.1: Simple interpretation of RISC-V instructions (Fig. 3.1 with arrows annotated with **struct** types)

types communicated on each of the black arrows between steps, and each of the red arrows

to and from memory. In this and the next few chapters we will flesh out the details of all these struct types. We will use exactly the same struct types for Fife and Drum, *i.e.*, whether the implementation is pipelined or not. All these `struct` declarations can be found in the file: `src_Common/Inter_Stage.bsv`.

## 6.2 BSV: struct types

Consider the black arrow from the Decode step to the Register-read-and-Dispatch step of Figure 6.1. We want to communicate several values, including:

- The current PC. This will be needed by BRANCH, JAL, JALR and AUIPC instructions to compute addresses that are offset from the current PC. It will be needed for any traps (exceptions) that may occur, which save PC for the trap-handler.
- An `exception` flag, indicating:
  - whether an error was encountered in the Fetch-to-Memory-to-Decode path, or
  - whether the Decode step’s analysis indicates that the instruction is not legal (unrecognized 32-bit code).

When the exception flag is true, a `cause` field provides more detail.

- If there is no exception (no Fetch memory error; instruction is legal), other fields provide more analytical detail for subsequent steps:
  - The “fall-through” PC, *i.e.*, the address of the next instruction following this one in memory. For RV32I and RV64I, this will always be PC+4, since all instructions are 4-bytes long.<sup>1</sup> For most instructions, the fall-through PC is indeed the unique next PC. For conditional BRANCH instructions, this is the next PC if the BRANCH is not taken. For JAL and JALR instructions (unconditional jumps), this is the “return address” saved by the instruction in a register.
  - The instruction itself. This will be needed for opcode details, the rs1, rs2 and rd register indexes, immediate values, *etc.*

The next several fields are derived by analyzing the instruction. They could be re-derived wherever needed by re-analyzing the instruction, but we perform that work just once in the Decode step and communicate the results.

- The `OpClass`. This indicates to which next-step in Figure 6.1 we dispatch for subsequent actions.
- Whether the instruction reads rs1 and/or rs2 register values. This will be needed to control reading from the register file.
- Whether the instruction writes an rd register value. This will be needed to control writing to the register file.
- The “immediate” value in the instruction. Refer to the top of each page of “Table 24.2 Instruction listing for RISC-V” in the Unprivileged Spec [26], which shows that the I-, S-, B-, U- and J-type instructions have immediate values of different sizes and encode them in different ways (“bit-swizzled”). We untangle these once in the Decode stage, and pass on the clarified results in the `imm` field.

---

<sup>1</sup>If we implement the “C” RISC-V ISA extension (compressed instructions), the correct fall-through PC may be PC+2.

This heterogeneous collection of values is most conveniently expressed as a **struct** type:

```
src_Common/Inter_Stage.bsv: line 49 ...
1  typedef struct {Bit #(XLEN)  pc;
2
3      Bool          exception; // Fetch exception/ decode illegal instr
4      Bit #(4)       cause;
5      Bit #(XLEN)   tval;
6
7      // If not exception
8      Bit #(XLEN)   fallthru_pc;
9      Bit #(32)     instr;
10     OpClass      opclass;
11     Bool          has_rs1;
12     Bool          has_rs2;
13     Bool          has_rd;
14     Bool          writes_mem; // All mem ops other than LOAD
15     Bit #(XLEN)   imm;        // Canonical (bit-swizzled)
16 ...
17 } Decode_to_RR
18 deriving (Bits, FShow);
```

Because we said “**deriving(Bits)**”, the *bsc* compiler will automatically work out a representation for **Decode\_to\_RR** values in bits, using the straightforward method of simply concatenating the bit-vectors of each field into a bit-vector for the whole struct. The total bit-size of a **Decode\_to\_RR** struct value is simply the sum of the individual bit-sizes of the fields. If we had not said “**deriving(Bits)**”, we could explicitly provide some other custom representation in bits.<sup>2</sup>

SystemVerilog makes a distinction between “packed” and “unpacked” values.

**NOTE:** In BSV all struct and vector values are packed (no padding between fields/elements) unless the user has explicitly over-ridden the “**deriving (Bits)**” directive with their own bit-representation function.

Unfortunately Verilog, the target language for the *bsc* compiler, does not have any concept of structs. When debugging Verilog code that has been produced by *bsc* from BSV source code, a struct will appear as a flat bit-vector that aggregates all the fields.

**CAVEAT:** Experienced BSV coders sometimes (perhaps temporarily for debugging) rearrange the order and sizes of fields in a struct so that they are well-aligned with 8-bit byte boundaries. Then, during debugging, while printing the values in hexadecimal, or viewing the values in a waveform viewer, it becomes easier to read off the field values from the displayed hexadecimal digits.

Because we said “**deriving(FShow)**”, the *bsc* compiler will automatically define an “**fshow()**” function for this type: if we print **fshow(v)**, it will print something like this:

---

<sup>2</sup>In C/C++, compilers will often “pad” out fields (insert unused bits between fields) to be aligned on byte and word boundaries, for more efficient access in byte-structured memories; thus, a struct’s size in C/C++ may be larger than the sum of the field sizes, and may even vary depending on the compiler’s target architecture. In hardware design, these values may reside in wires, registers, FIFOs, etc which have no “byte-structured” bias, and so we do not play any such “padding” games.

```
1 Decode_to_RR {pc=..., exception = ..., instr=..., ... }
```

### 6.2.1 Creating struct values

We can create a new value of type `D_to_RR` with syntax like this:

```
1 Decode_to_RR x = Decode_to_RR {pc:          ... value of field ... ,
2                               exception:    ... value of field ... ,
3                               cause:        ... value of field ... ,
4                               fallthru_pc:  ... value of field ... ,
5                               instr:        ... value of field ... ,
6                               ...};
```

The right-hand side is sometimes called a “struct expression”, *i.e.*, it is an expression which, when evaluated, produces a struct value.

The repetition of `Decode_to_RR` above seems verbose; the left-hand side instance is the type, and the right-hand side instance is the “struct constructor” (think of it as a function that takes the field values as arguments and returns a struct value). The `bsc` compiler’s type-analysis is able to infer the type from the right-hand side, so we can just use the keyword “`let`”:

```
1 let x = Decode_to_RR {pc:          ... value of field ... ,
2                       exception:    ... value of field ... ,
3                       cause:        ... value of field ... ,
4                       fallthru_pc:  ... value of field ... ,
5                       instr:        ... value of field ... ,
6                       ...};
```

The order in which the field values are given does not matter; the `bsc` compiler will put the fields into the correct offsets in the struct value.

### 6.2.2 Don’t-care values

Not all field values need be given in a struct expression. The `bsc` compiler will issue a warning for each unspecified field, and insert an “unspecified” (and unpredictable) value there. You can indicate that a field is intentionally left unspecified (and suppress the compiler warning) using “`?`”, BSV’s notation for a “don’t care” value:

```
1 let x = Decode_to_RR {pc:          ... ,
2                       exception:    False,
3                       cause:        ?,
4                       fallthru_pc:  ... ,
5                       instr:        ... ,
6                       ...};
```

In the above example, the exception cause field is meaningless when the exception field is false, and we indicate this explicitly with “?”.

“Don’t care” values are useful for several reasons. First, this conveys to the human reader that the value in this field is irrelevant.

Second, it can result in more efficient circuitry. If we had said “0”, for example, the *bsc* compiler has to create circuitry ensuring that field’s value is 0. By saying “?”, the *bsc* compiler is allowed to omit all that circuitry.

Third, in places where it does not result in additional hardware, the *bsc* compiler usually injects the specific value ’h\_AAAA\_AAAA (of suitable bit-width). While debugging, observing such a value in some computation is often a clue that something is wrong (it roughly plays the role of “X” values in Verilog/SystemVerilog/VHDL).

**NOTE:** Verilog, SystemVerilog and VHDL have a concept of “X” values. Each bit of a register or wire carries an “X” value until it has been assigned a specific binary value (0 or 1). However, note that this is *only in simulation*, where the simulator can and does model 3-valued logic (0, 1 and X) for each bit, and is able to propagate X values through operators, registers, etc. Hardware only implements 2-valued logic—every bit is either 0 or 1. Thus, this is an artefact that is only useful during debugging in simulation and static analysis.

BSV only has 2-valued logic; there is no concept of an “X” value. A BSV “?” expression has some specific, but potentially unpredictable, binary value.

### 6.2.3 Selecting struct fields

Struct fields can be selected using the usual “dot” notation common to SystemVerilog and C/C++:

```

1   x.pc
2   x.instr

```

### 6.2.4 Updating struct fields using assignment

Struct fields can be updated with assignment using the usual “dot” notation common to SystemVerilog and C/C++:

```

1   x.pc    = ... new value ... ;
2   x.instr = ... new value ... ;

```

## 6.3 BSV: Tuples and the match statement

Tuples are a built-in set of struct-like constructs in BSV that are often convenient. For example, when a function needs to return a pair of values, instead of declaring a struct type with two fields for the result, it is often more convenient to just use the built-in  `Tuple2` type. Example: (from the `mkCSRs` module in file `src_Common/CSRs.bsv`):

```

1   function ActionValue #(Tuple2 #(Bool, Bit #(XLEN)))
2       fav_csr_read (Bit #(12) csr_addr);
3   ...
4       let result = (exception, rd_val);
5       return result;
6   endfunction

```

The function returns a pair of values which have types `Bool` and `Bit #(XLEN)`, respectively. The overall type of the result is `Tuple2 #(Bool, Bit #(XLEN))`.

The expression on the right-hand side of the `let` statement shows how one can construct a 2-tuple value: simply enclose the components in parentheses and separate them with commas.

Unlike structs, there is no assignment statement to update a field of a tuple, nor is there any “`.field`” notation for component selection.

To select an individual field of a tuple, one can use one of the built-in selector functions:

```

1 let xy <- fav_csr_read (...);
2 let exc = tpl_1 (xy);      // exc has type: Bool
3 let v    = tpl_2 (xy);      // v    has type: Bit #(XLEN)

```

Alternatively, to access multiple fields of a tuple at once, one can use a “`match`” statement. Example (from `src_Drum/CPU_FSM.bsv`):

```

1 match { .exc, .v } <- fav_csr_read (csr_addr);

```

The `match` keyword is followed by a *pattern* that should match the tuple, *i.e.*, here it should be a list of two variable names surrounded by braces with each variable preceded by a “`.`”. The leading “`.`” is important: “`x`” would represent the value of an existing variable to be matched with the corresponding value in the tuple, whereas “`.x`” *introduces a new variable* `x` to be bound to the corresponding value in the tuple; the latter is what we want. We refer the reader to the BSV Language Reference Guide [24] for many more capabilities of pattern-matching, including fields to be ignored, pattern-matching on structs and tagged unions, pattern-matching `case` expressions, and more.

BSV defines tuples of up to 8 components (`Tuple_3`, `Tuple_4`, ...) but we recommend not using tuples when you have more than, say, 2 or 3 components; define a struct instead, with readable and meaningful field names.

## 6.4 RISC-V: Memory Requests and Responses; IMem and DMem

In Figure 6.1, the `Mem_Req` and `Mem_Rsp` structs are used in two places. The Fetch step issues a memory request, and the corresponding memory response is received by the Decode step. Similarly, the “Execute Memory Ops” steps issues a memory request and consumes a memory response. We use the shorthand term “IMem” for the first context (for Instruction Memory) and “DMem” for the latter context (for Data Memory).

### 6.4.1 Separation of IMem and DMem (Harvard Architecture)

The separation of memory channels for instructions and data (loads/stores) is quite standard in modern CPU architectures, and is informally called a “Harvard Architecture”. The term refers to the architecture of the Harvard Mark I computer, designed and built by Harvard University and IBM in the 1940s (the term itself was coined much later). It sometimes refers just to separate, concurrent paths to memory for instructions and data, and sometimes also to physically separate memories for instructions and data (more discussion in Wikipedia: [https://en.wikipedia.org/wiki/Harvard\\_architecture](https://en.wikipedia.org/wiki/Harvard_architecture)).

Modern software is typically not “self-modifying”, *i.e.*, instructions and data are placed in different areas of memory, and load/store instructions never write into the instruction area, *i.e.*, programs never over-write instructions in memory. This allows separate hardware for memory access for instructions *vs.* memory access for data, which can run concurrently, *i.e.*, we may fetch an instruction at the same time as we are accessing data memory for a previous load/store instruction (we will see this in Fife). We can also tune and optimize each memory path separately for their different dynamic behavioral patterns. In some systems we can also *protect* the instruction memory area, *i.e.*, enforce in hardware the policy of not over-writing instructions.

This view of strict separation of IMem and DMem has to be tempered somewhat when considering languages like JavaScript, Python *etc.* that employ so-called “JIT” compiling (“Just-In-Time”). The run-time systems of such languages generate instructions on-the-fly, *i.e.*, LOAD/STORE instructions produce *data* through the DMem channel that will (soon) be fetched as *instructions*. But even in these systems, there is a strict protocol of *phases*. During a code-generation phase, the data produced is considered as ordinary data. Then there is a deliberately executed phase-change, where the virtual memory protections of the data-pages just written are changed so that they are now viewed as read-only instruction pages, after which these new instructions can be fetched.

### 6.4.2 Memory Requests

A memory-request in RV32I is either a LOAD, a STORE or a FENCE. We could define an enum type for this:

```

1  typedef enum {MEM_REQ_LOAD,
2                  MEM_REQ_STORE
3                  MEM_REQ_FENCE} Mem_Req_Type
4  deriving (Eq, FShow, Bits);

```

which will use a 2-bit encoding (0 for LOAD, 1 for STORE, 2 for FENCE). However, we look to the future where we might extend Drum and Fife to implement the “A” extension (for “Atomic Memory Operations”). The RISC-V ISA Unprivileged Spec document, Chapter 8, described additional memory operations LR (Load-Reserved), SC (Store-Conditional), AMOSWAP, AMOADD, AMOXOR, AMOAND, AMOOR, AMOMIN, AMOMAX, AMOMINU, and AMOMAXU, each of which comes in a 32-bit version (in RV32 and RV64) and a 64-bit version (in RV64). These operations are coded with 5 bits in the instruction (`instr[31:27]`).

Accordingly, we use a 5-bit encoding for LOAD, STORE and FENCE, using 5-bit codes that are not used in the A extension:

```
src_Common/Instr_Bits.bsv: line 226 ...
1 Bit #(5) funct5_LOAD      = 5'b_11110;
2 Bit #(5) funct5_STORE     = 5'b_11111;
3 Bit #(5) funct5_FENCE    = 5'b_11101;
```

An IMem request is for one 32-bit instruction (four bytes).<sup>3</sup> A DMem request may be for one, two or four bytes. We express these request-size options using an enum type:

```
src_Common/Mem_Req_Rsp.bsv: line 39 ...
1 typedef enum {MEM_1B, MEM_2B, MEM_4B, MEM_8B} Mem_Req_Size
2 deriving (Eq, FShow, Bits);
```

A memory request bundles a request type, a size, and an address. For memory-writes, we also bundle the data to be stored. We express this bundle using a struct:

```
src_Common/Mem_Req_Rsp.bsv: line 42 ...
1 typedef struct {Mem_Req_Type  req_type;
2                     Mem_Req_Size   size;
3                     Bit #(64)      addr;
4                     Bit #(64)      data;    // CPU => mem data
5                     ...
6 } Mem_Req
7 deriving (Eq, FShow, Bits);
```

In `Mem_Req_Size`, the option `Mem_8B` is not possible in RV32I. Similarly, in `Mem_Req` the `addr` and `data` fields need only be 32-bits wide for RV32I. However, we have declared them as shown looking ahead into the future, where we may wish to implement RV64I, or where we may wish to implement the “D” ISA extension (double-precision floating point values, which are represented in 64-bits).

For STORE requests of 1 and 2 bytes (*i.e.*, smaller than the `data` field) we assume the data is passed in the least-significant bytes of the `data` field.

This is the information sent to Memory from the Read-PC-and-Fetch step and also from the Execute-Memory-Ops step in Figure 6.1.

### 6.4.3 Address Alignment

Although nowadays we think of all computer memories in units of 8-bit bytes and being byte-addressed,<sup>4</sup> in practice in hardware, it is usually simpler if memory-requests are *aligned* to an address according to the request size. Specifically, the address for a 2-byte request should be even, *i.e.*, the least significant bit of the address, `addr[0]`, should be zero. The

---

<sup>3</sup>When implementing the “C” RISC-V ISA extension (compressed instructions), instructions can also be 16-bits (2 bytes). When implementing more sophisticated Fetch units, we may actually fetch much larger chunks, such as a full cache line.

<sup>4</sup>Some early computers, until about the late 1970s, had other memory granularities—multiples of 6, 7, 9 bits, *etc.* Those were the days of bespoke memories for each computer design. Mass-production of memory chips resulted in standardization to 8-bit bytes.

address for a 4-byte request should have zero in the two least significant bits (`addr[1:0]`) and the address for an 8-byte request should have zero in the three least significant bits (`addr[2:0]`).

We can see why address-alignment is desirable. Memory implementations (chips) are usually architected to retrieve multiple bytes at a time (*e.g.*, 64 bytes) so that all those bytes can share addressing and control circuitry. With such an organization, a misaligned access request may straddle the boundaries of such “naturally sized” units and so may require two consecutive reads/writes. Caches are usually organized to hold multi-byte *cache lines* (*e.g.*, 32 bytes) in order to share the addressing and miss-handling circuitry, and to move data efficiently in and out of the cache. Again, a misaligned access request may straddle a cache-line boundary, and may require two consecutive accesses, which may hit or miss independently. Virtual memory systems are usually organized in *pages*, units of typically 4K-8K bytes, in order to share virtual-memory handling circuitry, and to move data efficiently between main memory and disks. Again, a misaligned access may straddle a page boundary, and may require two consecutive accesses, which may hit or page-fault independently and differently. In short, misaligned accesses add significant complexity to memory-system hardware design.

We can organize our software so that misaligned accesses are exceedingly rare. Most software is produced by compilers, and the compiler ensures that instructions and data are placed in memory at aligned addresses, possibly by padding gaps between “adjacent” smaller-sized data (such as a pair of 1-byte-sized fields in a struct). This padding may waste a few bytes of memory, but pays back in greater speed and reduced complexity.

Although misaligned accesses are rare, we cannot always guarantee their absence in software, since software can calculate an arbitrary address before performing a memory access. It seems wasteful to have to pay for extra hardware complexity (with attendant loss in overall performance) for such rare cases. In many computer systems, therefore, these rare misaligned accesses are relegated to software handling:

- The memory system simply refuses to handle a misaligned access, and returns a “misaligned” error instead.
- The CPU, receiving such an error response, undergoes a “trap” which directs it to piece of software called an trap-handler (or exception handler). The trap-handler (in software) performs the memory access in multiple smaller pieces (in the worst case, of size 1 byte), each of which is aligned. In other words, the trap-handler “completes” the original memory access before resuming the main-line code that attempted it.

The RISC-V ISA specification does not forbid misaligned accesses nor prescribe how they should be handled. Some implementations will handle it in hardware, and other implementations will return a “misaligned” error and rely on a trap handler to complete the access. Some implementations may only run software where it has been proven to not generate misaligned memory requests, and therefore may not even contain a trap-handler for misaligned accesses.

#### 6.4.4 Memory Responses

The response from memory for any request may be to report success, an alignment error, or some other error. Examples of “other errors” are:

- Absence of memory at the given address. For example, although RV32I addresses are 32-bits, which can address 4GiB of memory, we may provision our system with something smaller, say 1 GiB.
- An unsupported operation. *E.g.*, an attempt to write into a read-only memory (ROM).
- Corruption of data, due to electrical glitches, environmental electromagnetic pulses, *etc..* These errors are usually detected with some kind of error-detecting code, such as parity bits.

These different memory response-types can be encoded in an enum type:

```
src_Common/Mem_Req_Rsp.bsv: line 56 ...
1  typedef enum {MEM_RSP_OK,
2      MEM_RSP_MISALIGNED,
3      MEM_RSP_ERR,
4      ...
5  } Mem_Rsp_Type
6  deriving (Eq, FShow, Bits);
```

A memory-response contains the response-type. For a LOAD request with an OK response, it also contains the data that was read from memory. This can be expressed in a struct:

```
src_Common/Mem_Req_Rsp.bsv: line 65 ...
1  typedef struct {Mem_Rsp_Type  rsp_type;
2      Bit #(64)      data;        // mem => CPU data
3      ...
4  } Mem_Rsp
5  deriving (Eq, FShow, Bits);
```

For LOAD requests of 1 and 2 bytes (*i.e.*, smaller than the **data** field) we assume the data is returned in the least-significant bytes of the **data** field.

# Chapter 7

## RISC-V: Core functions for RISC-V ISA execution (used in Drum and Fife)

### 7.1 Introduction

In this chapter we discuss the core functions of Figure 7.1: `fn_Fetch`, `fn_Decode`, `fn_Dispatch`, `fn_EX_Control` and `fn_EX_Int`.



Figure 7.1: Simple interpretation of RISC-V instructions

### 7.2 The function `fn_Fetch`

The Fetch function *per se* is fairly simple, even trivial. Its input is the current value of the program counter (PC), which is used as the address in a memory-request to IMem. It has two outputs,

- A *memory request* to memory, to read an instruction. We have already seen the definition of the Mem\_Req struct in Section 6.4.2.
- Some additional information “Fetch\_to\_Decode” passed on to the Decode step.

The “Fetch\_to\_Decode” struct has only one interesting field, the PC:

```
src_Common/Inter_Stage.bsv: line 27 ...
1 typedef struct {
2     Bit #(XLEN) pc;
3     ...
4     Bit #(64)    inum;           // for debugging only
5 } Fetch_to_Decode
6 deriving (Bits, FShow);
```

The field `inum` holds the “instruction number”, which is a sequence number counting every instruction fetched. It is used only for debugging, to be able to identify a specific fetched instruction. (In the source code you will see two more fields `predicted_pc` and `epoch`; ignore these for now, they are not used in Drum, only in Fife.)

To pass both results of `fn_Fetch`, we simply use a *nested* struct, *i.e.*, a struct containing the two component structs:

```
src_Common/Fn_Fetch.bsv: line 26 ...
1 typedef struct {
2     Fetch_to_Decode to_D;
3     Mem_Req mem_req;
4 } Result_F
5 deriving (Bits, FShow);
```

Finally, as mentioned earlier, the function `fn_Fetch` is almost trivial: it merely fills in the two result structs based on argument values and returns them:

```
src_Common/Fn_Fetch.bsv: line 32 ...
1 // This is actually a pure function; is ActionValue only to allow
2 // $display insertion for debugging
3 function ActionValue #(Result_F)
4     fn_Fetch (Bit #(XLEN) pc,
5               ...
6               Bit #(64)    inum,
7               ...
8     actionvalue
9         Result_F y = ?;
10        // Info to next stage
11        y.to_D = Fetch_to_Decode {pc:          pc,
12                               ...
13        // Request to IMem
14        y.mem_req = Mem_Req {req_type: funct5_LOAD,
15                             size:      MEM_4B,
16                             addr:      zeroExtend (pc),
17                             data :    ?,
18                             // Debugging
19                             inum:     inum,
20                             ...
21        return y;
```

```

22     endactionvalue
23 endfunction

```

As the comment at the top of the code excerpt says, `fn_Fetch` is actually a pure function (no side-effects), and we could have written it as follows:

```

1  function Result_F
2      fn_Fetch (Bit #(XLEN)  pc,
3                  ...
4                  Bit #(64)      inum);
5      Result_F y = ?;
6      ...
7      return y;
8  endfunction

```

i.e., the return-type is `Result_F` instead of `ActionValue#(Result_F)`, and we simply drop the “`actionvalue—endactionvalue`” bracket keywords. Recall Section 5.6.1 where we discussed pure *vs.* side-effecting functions, and their distinction in the type-system through the absence/presence of the `ActionValue` type constructor. Even though this is a pure function, by couching it as an `ActionValue` type, we preserve the option of inserting a `$display` (which *is* a side-effect) for debugging purposes, should we need it in the future.

In line 9 we declare variable `y` of type `Result_F`, with unspecified initial value (recall Section 6.2.2 where we discussed using “?” to indicate an unspecified or don’t-care value). The two fields of `y` are then updated in lines 11 and 14, respectively, and `y` is finally returned as the result of the function.

We also use “?” in line 17 the `data` field is only relevant in a STORE memory request, whereas this is a FETCH request.

### Exercise 7.1:

Write a testbench for `fn_Fetch()`, apply it to a number of 32-bit values (PC values) and print the results using `$display` and `fshow`, and visually check that the `Fetch_to_Decode` and `Mem_Req` outputs look correct.

□

## 7.3 The `fn_Decode` function

The core function for the Decode step is called `fn_Decode`. Its arguments are a struct of type `Fetch_to_Decode` from the Fetch step and a `Mem_Rsp` memory response struct from memory. Its output struct type, `Decode_to_RR`, was described in Section 6.2. The code for `fn_Decode` is mostly a big if-then-else that analyses the incoming instruction and produces some summary information:

```

src_Common/Fn_Decode.bsv: line 28 ...
1  function ActionValue #(Decode_to_RR)
2      fn_Decode (Fetch_to_Decode  x_F_to_D,
3                  Mem_Rsp          rsp_IMem,
4                  ...
5      actionvalue
6          Bit #(32) instr = truncate (rsp_IMem.data);
7          Bit #(5)  rd    = instr_rd (instr);
8
9          let fallthru_pc = x_F_to_D.pc + 4;
10
11         // Baseline info to next stage
12         let y = Decode_to_RR {pc:           x_F_to_D.pc,
13
14                         exception:  False,
15                         cause:       ?,
16                         tval:        0,
17
18                         // not-exception
19                         fallthru_pc: fallthru_pc,
20                         instr:       instr,
21                         opclass:     ?,
22                         has_rs1:     False,
23                         has_rs2:     False,
24                         has_rd:      False,
25                         writes_mem: False,
26                         imm:        0,
27                         ...
28
29         Bool non_zero_rd = (rd != 0);
30
31         if (rsp_IMem.rsp_type == MEM_RSP_MISALIGNED) begin
32             y.exception = True;
33             y.cause     = cause_INSTRUCTION_ADDRESS_MISALIGNED;
34             y.tval      = truncate (rsp_IMem.addr);
35         end
36         else if (rsp_IMem.rsp_type == MEM_RSP_ERR) begin
37             y.exception = True;
38             y.cause     = cause_INSTRUCTION_ACCESS_FAULT;
39             y.tval      = truncate (rsp_IMem.addr);
40         end
41         ...
42         else if (is_legal_LUI (instr) || is_legal_AUIPC (instr)) begin
43             y.opclass = OPCLASS_INT;
44             y.has_rd = non_zero_rd;
45             y.imm    = signExtend ({ instr_imm_U (instr), 12'h000 });
46         end
47         else if (is_legal_BRANCH (instr)) begin
48             y.opclass = OPCLASS_CONTROL;
49             y.has_rs1 = True;
50             y.has_rs2 = True;
51             y.imm    = signExtend (instr_imm_B (instr));
52         end
53         else if (is_legal_JAL (instr)) begin

```

```

54     y.opclass = OPCLASS_CONTROL;
55     y.has_rd = non_zero_rd;
56     y.imm    = signExtend (instr_imm_J (instr));
57 end
58 else if (is_legal_JALR (instr)) begin
59     y.opclass = OPCLASS_CONTROL;
60     y.has_rs1 = True;
61     y.has_rd = non_zero_rd;
62     y.imm    = signExtend (instr_imm_I (instr));
63 end
64 else if (is_legal_LOAD (instr)) begin
65     y.opclass = OPCLASS_MEM;
66     y.has_rs1 = True;
67     y.has_rd = non_zero_rd;
68     y.imm    = signExtend (instr_imm_I (instr));
69 end
70 else if (is_legal_STORE (instr)) begin
71     y.opclass = OPCLASS_MEM;
72     y.has_rs1 = True;
73     y.has_rs2 = True;
74     y.writes_mem = True;
75     y.imm    = signExtend (instr_imm_S (instr));
76 end
77 else if (is_legal_OP_IMM (instr)) begin
78     y.opclass = OPCLASS_INT;
79     y.has_rs1 = True;
80     y.has_rd = non_zero_rd;
81     y.imm    = signExtend (instr_imm_I (instr));
82 end
83 else if (is_legal_OP (instr)) begin
84     y.opclass = OPCLASS_INT;
85     y.has_rs1 = True;
86     y.has_rs2 = True;
87     y.has_rd = non_zero_rd;
88 end
89 else if (is_legal_ECALL (instr)
90         || is_legal_EBREAK (instr)
91         || is_legal_MRET (instr)) begin
92     y.opclass = OPCLASS_SYSTEM;
93 end
94 else if (is_legal_CSRRx (instr)) begin
95     y.opclass = OPCLASS_SYSTEM;
96     y.has_rs1 = (instr_funct3 (instr) [2] == 0);
97     y.has_rd = non_zero_rd;
98 end
99 else if (is_legal_FENCE (instr)) begin
100    y.opclass = OPCLASS_FENCE;
101 end
102 else begin
103     y.exception = True;
104     y.cause    = cause_ILLEGAL_INSTRUCTION;
105     y.tval     = truncate (instr);
106 end
107

```

```

108     return y;
109   endactionvalue
110 endfunction

```

In line 6, we extract the instruction from the `Mem_Rsp` memory response `data` from the Fetch operation. The `truncate` operation is used to shrink the bit-vector width of `rsp_IMem.data` (64 bits) to the bit-vector width of `instr` (32 bits). (Section 6.4.2 discussed why we declared `rsp_IMem.data` to be 64-bits wide). The `truncate` operation is polymorphic, accepting arguments of any bit-width that is at least as wide as the required output. Note, `truncate` keeps least-significant bits and drops most-significant bits.

In line 7 we extract the `rd` (“destination register”) field from the instruction. In line 9 we compute the fall-through PC, `PC+4` (with the caveat that if extend the implementation to support the “C” RISC-V ISA extension (“Compressed” instructions), it may be `PC+2`, which information can be gleaned from the instruction encoding).

In lines 11-26 we create a baseline `Decode_to_RR` value which we will selectively modify in the if-then-else statements that follow.

In lines 31-40 we first handle the situations where the Fetch operation to memory itself returned an error. We mark the `exception` field True and fill in the appropriate `cause` and `tval` (these will be placed in MCAUSE and MVAL CSRs in the Retire step).

The rest of the code is a series of if-then-else clauses. Each clause identifies one class of instruction and updates the `opclass` field correspondingly. The repertoire of instructions that we consider are the forty instructions listed in the “RV32I Base Instruction Set” table of “Table 24.2: Instruction listing for RISC-V” of the Unprivileged Spec [26].

Each if-then-else clause also fills in the `has_rs1`, `has_rs2` and `has_rd` fields, as appropriate, for each class of instruction. Note that the `has_rd` field is set to False if `rd` is zero (recall that general-purpose register `x0` ignores writes and always reads as 0).

We also decode each kind of “immediate” and fill in the `imm` field in the struct. Recall from Section 2.5, including Figures 2.4 and 2.5, that different classes of instructions encode immediate values in different ways, and the immediate values can have different bit-widths. We use the functions `instr_imm_I()`, `instr_imm_S()`, `instr_imm_B()`, `instr_imm_U()` and `instr_imm_J()` to extract the and rearrange the immediate bits appropriately for each class of instruction. Then, here in `fn_Decode`, we zero- or sign-extend each immediate as appropriate so that, from this point onwards, each immediate can be treated as an ordinary `Bit#(XLEN)` value.

An exercise below suggests that you write the code for these `instr_imm_X` functions; it’s good practice for the BSV beginner!

The final “else” clause is selected if the instruction does not match any of the forty RV32I instructions. In this case we set the `exception` field, and set the `cause` field to indicate an illegal instruction.

Observe that the entire `fn_Decode()` function is just a (large) combinational circuit—it is an acyclic composition of smaller combinational circuits, many of which we’ve seen earlier. The whole `fn_Decode()` function can be visualized as a box with incoming wires corresponding to the `Fetch_to_Decode` struct and the `Mem_Rsp` struct, outgoing wires corresponding to

the `Decode_to_RR` struct, and filled with logic gates that compute each output wire as a function of the input wires.

---

### Exercise 7.2:

The provided source code includes the functions `instr_imm_I()`, `instr_imm_S()`, `instr_imm_B()`, `instr_imm_U()` and `instr_imm_J()` (in file `Instr_Bits.bsv`), but try to write them yourself first, and compare your solutions to the provided codes.

### Exercise 7.3:

Write a testbench for `fn_Decode()`, apply it to a number of PC and instruction values. For each input PC value, construct an `Fetch_to_Decode` struct around it. For each input instruction, construct a `Mem_Rsp` struct around it, some with memory errors, some without. Apply `fn_Decode` to such pairs. Print the results using `$display` and `fshow`, and visually check that the `D_to_RR` outputs look correct.

□

---

## 7.4 The `fn_Dispatch` function after reading input registers

In the Register-Read and Dispatch step (“RR”, “RRD”), we first read the `rs1` and `rs2` values from the Register File. We will cover register files in Section 8.4, and register-reads when we discuss the Drum CPU itself, in Chapter 11.

With the `rs1` and `rs2` values in hand, we use function `fn_Dispatch` to determine the information to be passed to the various alternative steps (“flows”) that follow. Its argument is a `Decode_to_RR` struct from the Decode step, which was discussed in Section 6.2. Before we look at the result type of `fn_Dispatch`, let us discuss the four possible subsequent flows:

- “Direct”: Some information is sent directly from RR to Retire for *every* instruction. Crucially, RR sends a *tag* that indicates which of the four flows is being followed for the current instruction.

Additional direct information includes information from RR (PC, whether an exception has already been seen in the Fetch or Decode stages, `has_rd`, `writes_mem`, the instruction, the fall-through PC and the `rs1` value (needed for CSRRxx instructions)).

The direct flow is also used for all SYSTEM instructions, including CSRRxx, ECALL, EBREAK, and MRET.

- “Control”: If the instruction is a BRANCH, JAL or JALR, we produce information for the Execute Control flow.
- “Integer”: If the instruction is a LUI, AUIPC or integer arithmetic or logic (IALU) operation, we produce information for the Execute Integer flow.
- “DMem”: If the instruction is a LOAD, STORE or FENCE, we produce information for the Execute Memory flow.

The following enum type declaration defines four constants to identify the flow for this instruction:

```
src_Common/Inter_Stage.bsv: line 75 ...
1 typedef enum {EXEC_TAG_DIRECT,
2     EXEC_TAG_CONTROL,
3     EXEC_TAG_INT,
4     EXEC_TAG_DMEM
5 } Exec_Tag
6 deriving (Bits, Eq, FShow);
```

The result type of `fn_Dispatch` is `Result_Dispatch`. It is just a nested struct that contains four different struct types for the four flows (the struct types can be seen on arc labels in Figure 7.1).

```
src_Common/Fn_Dispatch.bsv: line 30 ...
1 typedef struct {
2     RR_to_Retire      to_Retire;
3     RR_to_EX_Control to_EX_Control;
4     RR_to_EX          to_EX;
5     Mem_Req           to_EX_DMem;
6 } Result_Dispatch
7 deriving (Bits, FShow);
```

The first component is the information sent directly to Retire:

```
src_Common/Inter_Stage.bsv: line 82 ...
1 typedef struct {Exec_Tag      exec_tag;    // 'flow' for this instr
2
3     Bit #(XLEN)  pc;
4     Bool        has_rd;      // From RR
5     Bool        writes_mem; // From RR
6
7     Bool        exception;  // Fetch exception, decode illegal instr
8     Bit #(4)    cause;
9     Bit #(XLEN) tval;
10
11    // If not exception
12    Bit #(32)    instr;
13    Bit #(XLEN)  fallthru_pc;
14    Bit #(XLEN)  rs1_val;   // For CSRRXX instrs
15    ...
16
17    Bit #(64)    inum;       // for debugging only
18 } RR_to_Retire
19 deriving (Bits, FShow);
```

The `exec_tag` informs Retire about the flow for this instruction. The `exception` and `cause` fields from `Decode_to_RR` are carried through, as-is, since it is the Retire step that handles all exceptions. Note, in addition to passing these exceptions in RR to Retire, the Control or Execute steps can also produce exceptions. The `pc` field is needed in case Retire needs to handle a trap or interrupt, which saves the `pc` before handling it.

The `has_rd` field is carried through to control whether Retire tries to write a value back to the register file nor not. The `fallthru_pc` is used for most instructions that complete successfully (without raising an exception).

The second component of `Result_Dispatch` is the information sent to Execute Control (we will see how these fields are used in Section 7.5 when we discuss `fn_EX_Control`):

```
src_Common/Inter_Stage.bsv: line 110 ...
1 typedef struct {Bit #(XLEN) pc;
2                 Bit #(XLEN) fallthru_pc;
3                 Bit #(32) instr;
4                 Bit #(XLEN) rs1_val;
5                 Bit #(XLEN) rs2_val;
6                 Bit #(XLEN) imm;
7                 Bit #(64) inum;    // for debugging only
8 } RR_to_EX_Control
9 deriving (Bits, FShow);
```

The third component of `Result_Dispatch` is the information sent to Execute Int (we will see how these fields are used in Section 7.6 when we discuss `fn_EX_Int`):

```
src_Common/Inter_Stage.bsv: line 141 ...
1 typedef struct {Bit #(32) instr;
2                 Bit #(XLEN) rs1_val;
3                 Bit #(XLEN) rs2_val;
4                 Bit #(XLEN) imm;
5                 ...
6 } RR_to_EX
7 deriving (Bits, FShow);
```

We choose to call it `RR_to_EX` instead of `RR_to_EX_Int` because the same struct type is likely to be used in future also for the other optional execution pipes shown in Figure 7.1: Integer Multiply Divide (“M” ISA extension), floating point (“F” and “D” ISA extensions) and Custom (non-standard ISA extensions).

The third component of `Result_Dispatch` is the information sent to Execute DMem, and is the same `Mem_Req` struct we have already discussed in Section 6.4.2, and which is also an output component of `fn_Fetch`.

The code for `fn_Dispatch` is shown below. Its arguments are the `Decode_to_RR` struct from the Decode stage, and values from the source registers `rs1` and `rs2`.

```
src_Common/Fn_Dispatch.bsv: line 40 ...
1 function ActionValue #(Result_Dispatch)
2     fn_Dispatch (Decode_to_RR           x,
3                  Bit #(XLEN)        rs1_val,
4                  Bit #(XLEN)        rs2_val,
5                  ...
6     actionvalue
7         // Compute tag to control merging at Retire
8         Exec_Tag exec_tag = EXEC_TAG_DIRECT;    // exceptions and OPCLASS_SYSTEM
9         if (!x.exception) begin
10             if      (x.opclass == OPCLASS_CONTROL) exec_tag = EXEC_TAG_CONTROL;
11             else if (x.opclass == OPCLASS_INT)      exec_tag = EXEC_TAG_INT;
```

```

12     else if (x.opclass == OPCLASS_MEM)      exec_tag = EXEC_TAG_DMEM;
13     else if (x.opclass == OPCLASS_FENCE)    exec_tag = EXEC_TAG_DMEM;
14 end
15
16 let to_Retire = RR_to_Retire {exec_tag:      exec_tag,
17                               pc:            x.pc,
18                               has_rd:        x.has_rd,
19                               writes_mem:   x.writes_mem,
20
21                               exception:    x.exception,
22                               cause:         x.cause,
23                               tval:          x.tval,
24
25                               instr:         x.instr,
26                               fallthru_pc:  x.fallthru_pc,
27                               rs1_val:       rs1_val,
28
29 ...
30
31 // -----
32 // Info for EX_Control
33 let to_EX_Control = RR_to_EX_Control {pc:           x.pc,
34                                       fallthru_pc:  x.fallthru_pc,
35                                       instr:         x.instr,
36                                       rs1_val:       rs1_val,
37                                       rs2_val:       rs2_val,
38                                       imm:          x.imm,
39
40 ...
41 // -----
42 // Info for Execute Int pipe
43 let to_EX = RR_to_EX {pc:       x.pc,
44                       instr:    x.instr,
45                       rs1_val:  rs1_val,
46                       rs2_val:  rs2_val,
47                       imm:      x.imm,
48
49 ...
50 // -----
51 // Info for Execute DMem pipe
52 Bit #(XLEN) eaddr = rs1_val + x.imm;
53 Mem_Req_Size mrq_size = unpack (x.instr [13:12]); // B, H, W or D
54 Mem_Req_Type mrq_type = (is_LOAD (x.instr) ? funct5_LOAD
55                           : (is_STORE (x.instr) ? funct5_STORE
56                           : (is_FENCE (x.instr) ? funct5_FENCE
57                           : funct5_BOGUS)));
58
58 let to_EX_DMem = Mem_Req {req_type: mrq_type,
59                           size:    mrq_size,
60                           addr:   zeroExtend (eaddr),
61                           data:   zeroExtend (rs2_val),
62
63 ...
64 // -----
65 // Construct and return final result
66 let result = Result_Dispatch {to_Retire:      to_Retire,

```

```

66          to_EX_Control: to_EX_Control,
67          to_EX:           to_EX,
68          to_EX_DMem:      to_EX_DMem};

69      return result;
70  endactionvalue
71 endfunction

```

Lines 7-14 compute `exec_tag`, *i.e.*, the flow to be followed.

Lines 16-29, 33-38, 42-46 and 49-61 construct the flow-specific struct values. The latter three are meaningful only if `exec_tag` indicates Control, Integer or DMem, respectively, but there is no harm in constructing them, even if they may contain bogus data; they will only be used when their flows are chosen and ignored otherwise.

Lines 64-68 construct and return the final result with the four component structs.

## 7.5 The fn\_EX\_Control function

This function is a simple one-input one-output function. The input type `RR_to_EX_Control` was described in Section 7.4, where it was an output type. The output type of `fn_EX_Control` is shown below:

```

src_Common/Inter_Stage.bsv: line 121 ...
1 typedef struct {Bool           exception; // Misaligned BRANCH/JAL/JALR target
2             Bit #(4)       cause;
3             Bit #(XLEN)   tval;
4
5             Bit #(XLEN)   next_pc;
6             Bit #(XLEN)   data;           // Return-PC for JAL/JALR
7             ...
8 } EX_Control_to_Retire
9 deriving (Bits, FShow);

```

Here is the Execute Control function:

```

src_Common/Fn_EX_Control.bsv: line 32 ...
1 function ActionValue #(EX_Control_to_Retire)
2     fn_EX_Control (RR_to_EX_Control x,
3
4         ...
5         Bit #(XLEN) next_pc = ?>;
6         Bool        exception = False; // Misaligned target_pc
7
8     if (is_BRANCH (instr)) begin
9         Bool branch_taken = case (instr_func3 (instr))
10             funct3_BEQ: (rs1_val == rs2_val);
11             funct3_BNE: (rs1_val != rs2_val);
12             funct3_BLT: signedLT (rs1_val, rs2_val);
13             funct3_BGE: signedGE (rs1_val, rs2_val);
14             funct3_BLTU: (rs1_val < rs2_val);
15             funct3_BGEU: (rs1_val >= rs2_val);
16             endcase;
17         let target_pc = x.pc + x.imm;

```

```

17     next_pc = (branch_taken ? target_pc : x.fallthru_pc);
18     exception = (branch_taken && (target_pc [1:0] != 0));
19     ...
20   end
21   else if (is_JAL (instr)) begin
22     next_pc = x.pc + x.imm;
23     exception = (next_pc [1:0] != 0);
24     ...
25   end
26   else if (is_JALR (instr)) begin
27     // zero out LSB in target PC
28     next_pc = ((rs1_val + x.imm) & ~1);
29     exception = (next_pc [1:0] != 0);
30     ...
31   end
32   ...
33
34   let y = EX_Control_to_Retire {exception: exception,
35                               cause:      cause_INSTRUCTION_ADDRESS_MISALIGNED,
36                               tval:       x.pc,
37                               next_pc:    next_pc,
38                               data:       x.fallthru_pc,
39                               ...
40
41   return y;
42 endactionvalue
endfunction

```

The first “if” clause handles BRANCH (conditional branch) instructions. The `case` expression computes the boolean value `branch_taken`, *i.e.*, the decision whether to take the branch or not. A case-clause is chosen based on the 3-bit `funct3` field of the instruction that identifies the specific condition to be tested.<sup>1</sup> Note that for BLT and BGE, we use BSV library functions `signedLT` and `signedGE` that interpret `rs1_val` and `rs2_val` as 2’s-complement signed integers.

Line 16 computes the target PC should the the branch be taken. Line 17 computes the next PC, which is either the target PC or the fall-through PC depending on whether or not the branch is taken. Finally, Line 18 checks, if the branch is taken, that the target PC is a properly aligned address (if not, we must raise an exception).

The next “else if” clause handles JAL (Jump and Link) instructions and the final “else if” clause handles the JALR (Jump and Link Register) instructions. They are both straightforward, unconditional calculations of a next PC, along with an alignment-check that the next PC is suitably aligned. As per the Unprivileged ISA specification document, JALR also zeroes the least-significant bit of target PC.

The final section constructs the `EX_Control_to_Retire` struct result and returns it.

#### Exercise 7.4:

<sup>1</sup>Unlike C/C++, where a case-clause “falls through” to the next case-clause unless you have a `break` statement, in BSV only one case-clause is executed, there is no fall-through.

In `fn_EX_Control`, the BRANCH, JAL and JALR clauses set the `exception` field to true if the next PC is not aligned.

1. What would happen if we did not set the exception field here?
2. See “Section 2.5 Control Transfer Instructions” in the Unprivileged ISA specification document [26]) for a discussion of why we set it here.

#### Exercise 7.5:

Prove (informally) that the three-way if-then-else shown above in `Fn_EX_Control` will catch all cases, *i.e.*, that we never need a final “`else`” clause. This requires reviewing `fn_Decode` and tracking the flow of information through the Register-Read-and-Dispatch step (including `fn_Dispatch`), into `fn_Control`.

#### Exercise 7.6:

In `Fn_EX_Control`, can we change the final “`else if`”:

```
else if (is_JALR (instr)) begin
```

into a simple “`else`”, *i.e.*, omit the the `is_JALR` check?

```
else begin
```

What might be the hardware implication of such a change?

#### Exercise 7.7:

In the final section of `fn_EX_Control`, why do we have: `data: x.fallthrough_pc`?

*Hint:* review the semantics of JAL and JALR instructions.

What if this is a BRANCH instruction?

What if this is a JAL or JALR instruction with `rd = 0`?

□

## 7.6 The `fn_EX_Int` function

This function is a simple one-input one-output function. The input type `RR_to_EX` was described in Section 7.4, where it was an output type. The output type of `fn_EX_Int` is shown below:

```
src_Common/Inter_Stage.bsv: line 154 ...
1  typedef struct {Bool      exception;
2          Bit #(4)    cause;
3          Bit #(XLEN) tval;
4
5          Bit #(XLEN)  data;
6          ...
7  } EX_to_Retire
8  deriving (Bits, FShow);
```

We choose to call it `EX_to_Retire` instead of `EX_Int_to_Retire` because the same struct type is likely to be used in future also for the other optional execution pipes shown in Figure 7.1: Integer Multiply Divide (“M” ISA extension), floating point (“F” and “D” ISA extensions) and Custom (non-standard ISA extensions).

Here is the Execute Integer function:

```
src_Common/Fn_EX_Int.bsv: line 31 ...
1  function ActionValue #(EX_to_Retire)
2      fn_EX_Int (RR_to_EX  x,
3          ...
4      actionvalue
5          let instr = x.instr;
6
7          let y = EX_to_Retire {exception: False,
8              cause:    ?,
9              tval:     ?,
10             data:     ?,
11             ...
12
13         if (is_LUI (instr)) begin
14             y.data = x.imm;
15             ...
16         end
17         else if (is_AUIPC (instr)) begin
18             y.data = x.pc + x.imm;
19             ...
20         end
21         else begin
22             let result <- fn_IALU (instr, x.rs1_val, x.rs2_val, x.imm,
23             ...
24             y.data = result;
25             ...
26         end
27         return y;
28     endactionvalue
29 endfunction
```

The first “if” handles LUI instructions (Load Upper Immediate). The following “else if” handles AUIPC instructions (Add Upper Immediate to PC). The final “else” handles all the remaining Integer ops by invoking an “ALU” function:

```
src_Common/IALU.bsv: line 28 ...
1  function ActionValue #(Bit #(XLEN))
2      fn_IALU (Bit #(32)    instr,
3                  Bit #(XLEN)   v1,
4                  Bit #(XLEN)   v2,
5                  Bit #(32)    imm,
6                  ...
7      actionvalue
8          Bit #(7)    opcode = instr_opcode (instr);
9          Bit #(3)    funct3 = instr_funct3 (instr);
10         // Signed int versions of v1, v2 and imm
11         Int #(XLEN) iv1    = unpack (v1);
```

```

12     Int #(XLEN) iv2      = unpack (v2);
13     Int #(XLEN) i_immm  = unpack (imm);
14     ...
15     Bit #(XLEN) y_OP     = 0;
16     if (opcode == opcode_OP) begin
17     ...
18     Bit #(6) shamrt = (v2 [5:0] & ((xlen == 32) ? 'h1F : 'h3F));
19     case (funct3)
20       funct3_ADD:   y_OP = pack ((instr [30] == 1'b0)
21                               ? (iv1 + iv2)
22                               : (iv1 - iv2));
23       funct3_SLL:   y_OP = v1 << shamrt;
24       funct3_SLT:   y_OP = ((iv1 < iv2) ? 1 : 0);
25       funct3_SLTU:  y_OP = ((v1 < v2) ? 1 : 0);
26       funct3_XOR:   y_OP = v1 ^ v2;
27       funct3_SRL:   y_OP = v1 >> shamrt;
28       funct3_SRA:   y_OP = pack (iv1 >> shamrt);
29       funct3_OR:    y_OP = v1 | v2;
30       funct3_AND:   y_OP = v1 & v2;
31     ...
32     endcase
33     ...
34   end
35
36   Bit #(XLEN) y_OP_IMM = 0;
37   if (opcode == opcode_OP_IMM) begin
38     ...
39     Bit #(6) shamrt = (imm [5:0] & ((xlen == 32) ? 'h1F : 'h3F));
40     case (funct3)
41       funct3_ADDI:  y_OP_IMM = pack (iv1 + i_immm);
42       funct3_SLTI:  y_OP_IMM = ((iv1 < i_immm) ? 1 : 0);
43       funct3_SLTIU: y_OP_IMM = ((v1 < imm) ? 1 : 0);
44       funct3_XORI:  y_OP_IMM = v1 ^ imm;
45       funct3_ORI:   y_OP_IMM = v1 | imm;
46       funct3_ANDI:  y_OP_IMM = v1 & imm;
47       funct3_SLLI:  y_OP_IMM = v1 << shamrt;
48       funct3_SRLI:  y_OP_IMM = v1 >> shamrt;
49       funct3_SRAI:  y_OP_IMM = pack (iv1 >> shamrt);
50     ...
51     endcase
52     ...
53   end
54
55   Bit #(XLEN) result = y_OP | y_OP_IMM;
56   ...
57   return result;
58 endactionvalue
endfunction

```

The reason we create a separate `fn_IALU` function, instead of inlining it into `Fn_EX_Int` is because `fn_IALU` is not very RISC-V specific, and may be useful in other contexts that have nothing to do with RISC-V, Drum or FIfE.

In lines 11-13, we define signed-integer versions `iv1`, `iv2` and `i_immm` of the unsigned integer

values `v1`, `v2`, and `imm`, respectively, using the standard BSV `unpack` function. There is no hardware cost to this, these declarations are simply declarations to “view” the same bits differently (as 2’s-complement coded integers). The difference arises later, when we apply certain operators to these values. For example, lines 24-25 compute the SLT (Set Less Than (signed)) and SLTU (Set Less Than Unsigned) operations. SLT uses the signed values `iv1` and `iv2`, whereas SLTU uses the unsigned values `v1` and `v2`. Between the `bsc` compiler and the Verilog back-end, different code will be generated for the “`<`” operator to perform the correct kind of comparison.

Line 18 extracts a 5-bit “shift amount” from the `rs2` value for the shift operators SLL, SRL and SRA. Line 39 extract a 5-bit “shift amount” from the `imm` value for the shift operators SLLI, SRLI and SRAI.

SRL (Shift Right Logical) and SRA (Shift Right Arithmetic) differ in whether they treat the argument as a signed or unsigned value, the difference being whether the new bits shifted into the most-significant bit are zero (SRL) or replicate the most-significant bit (SRA). SRLI and SRAI exhibit a similar difference.

In lines 28 (SRA) and 49 (SRAI) we finally apply the “`pack`” operator to produce the result. This is because the expression “`(iv1 >> shamt)`” has type `Int#(XLEN)` whereas the result needs to be of type `Bit#(XLEN)`. The “`pack`” operator performs this type-change for us.

Lines 19-32 define the `y_OP` result when the opcode is `opcode_OP` i.e., `7'b_011_0011`, i.e., the “3-address” operators where the inputs come from `rs1` and `rs2`. `y_OP` defaults to 0 when it is not an `op_OP`.

Lines 40-49 define the `y_OP_IMM` result when the opcode is `opcode_OP_IMM` i.e., `7'b_001_0011`, i.e., the “2-address” operators where one input come from `rs1` and the other input comes from an immediate value in the instruction. `y_OP_IMM` defaults to 0 when it is not an `op_OP_IMM`.

Finally, line 45 combines these results using the “OR” function. We rely on the fact that exactly one of `y_OP` and `y_OP_IMM` can be relevant; the other one must be zero (and therefore has no effect through the OR’ing).

### Exercise 7.8:

Lines 11-45 could instead have been written this way:

```

1 Bit #(XLEN) y = 0;
2 if (opcode == opcode_OP) begin
3     ...
4     ... y = ...
5 end
6 else if (opcode == opcode_OP_IMM) begin
7     ...
8     ... y = ...
9 end
10 return y;

```

Discuss the hardware tradeoffs between writing it in these two ways. Hints: Consider:

- Sequentiality of if-then-else.
- Ability (or not) to prove exhaustiveness of conditions in nested if-then-else.
- Ability (or not) to prove mutual-exclusivity of conditions in nested if-then-else.
- Discussion in Section 5.11.1 on parallel and sequential multiplexers (mux). Note: in our code, we have explicitly coded a parallel mux.

**Exercise 7.9:**

Note that the ISA has ADD and ADDI instructions, but no corresponding SUB and SUBI (subtract) instructions. Why not?

**Exercise 7.10:**

Justify the presence or absence of the “pack” operator in each case of `fn_IALU`.

**Exercise 7.11:**

Suppose we want to extend `fn_IALU` so it also works when XLEN=64 (*i.e.*, for RV64I). What needs to change to accommodate this?

*Hint:* it only matters in the shift-amount of the shift instructions, where the shift-amount can be 6-bits wide instead of 5-bits (allowing a maximum of 63-bit shifts instead of 31 bits).

□

---

## 7.7 No separate functions for Execute DMem and Retire

We don’t need any separate functions for the Execute DMem step in Figure 7.1 because `fn_Dispatch` has created a `Mem_Req` struct that we can send directly to memory. Similarly, the memory returns a `Mem_Rsp` that is sent directly to the Retire step.

The code for the Retire step is discussed in detail separately in Chapter 11 for Drum and Chapter 17 for Fife. Although functionally the same, the codes are structurally sufficiently different that it is not worth attempting to abstract any common functionality into a shared function.



# Chapter 8

## BSV: Modules and Interfaces: Registers, Register Files and FIFOs

### 8.1 Introduction

It is good engineering practice to organize the code for any non-trivial system, whether in hardware or software, into a well-structured composition of smaller, manageable *modules*. Each module should have a clear, independent specification so that it can be understood on its own, and so that it can transparently be substituted by another module with the same functionality but perhaps other desirable properties (*e.g.*, speed, area, power). The external specification of a module—its “interface”—should not rely on, and ideally not even mention, internal implementation details of the module. For example, each of the units shown in Figure 3.1 could be a separate module.

This chapter is mostly a BSV chapter; we discuss modules and interfaces in BSV, including a few that are provided by the BSV libraries and that we will use in subsequent chapters to implement our RISC-V CPUs.

### 8.2 Modules: state, interfaces and behavior

Modules encapsulate modifiable *state*. Examples include Registers, Register Files and FIFOs (all of which are discussed later in this chapter). Modules with state are the only entities containing values that *persist* over time, *i.e.*, a value “written” at one moment in time can be “read” at a later moment in time.

Modules also encapsulate external behavior, using *interface methods*. In this sense they are similar to “objects” in object-oriented programming languages such as C++, Java, and Python. A BSV module is like an object constructor; a module *instance* is like an object; its internal state is like the internal, private “members” of the object, and its interface is a *set of methods* that can be invoked like functions/procedures, and which can access the internal state.

Modules and interfaces clearly separate concerns of externally-visible functionality (“external API”; *what* a module does; a module *specification*) *vs.* internal implementation details (*how* the module does it).

BSV modules are typically organized in a *hierarchy*—a top-level module, which instantiates sub-modules which, in turn, instantiate lower-level modules, and so on. In Drum and Fife:

```

Top-level module
  CPU module
    CPU sub-modules
      Library modules: Registers, Register file, FIFOs, ...
  Memory system
    Memory module(s)
    MMIO device modules (e.g., UART, timer, interrupt controller)

```

In the next several sections we describe the concepts of BSV modules and interfaces. These sections may require re-reading a couple of times; the concepts become properly internalized only after seeing/using/creating several examples.

**NOTE:** We sometimes write BSV modules that do not themselves contain internal state, for stylistic and readability reasons. One example is seen in Section 8.5.6 where a module is used to encapsulate the logic of connecting two complementary interfaces.

### 8.2.1 Internal behavior (*rules*)

Unlike most programming languages, BSV modules typically also contain *internal* free-running processes called *rules* that run concurrently with the rest of the system (all other rules in the system). Rules realize the independent, concurrent, internal behavior of a module.

Rules are discussed in more detail in Chapter 14. Before that, in Chapter 10 we will discuss BSV’s special notation for FSMs (Finite State Machines), which are simpler to use as a first step.

### 8.2.2 Interface declarations

An *interface declaration* in BSV declares a new interface, which is a new BSV *type*, and looks like this:

```

interface interface-type;
  ... method and sub-interface declarations ...
endinterface

```

The interface represents the external view of a module, *i.e.*, it declares a set of *methods* that can be invoked from an external context. Each method declaration only lists its arguments and their types, and the method’s overall result type. The *body* of the method is defined in the module definition of each module that offers this interface type.

Interfaces can be nested (can contain sub-interfaces which themselves have methods or sub-sub-interfaces, and so on). This is just a syntactic abstraction mechanism; ultimately, all

interactions with a module are through its methods, whether at the top level of the interface type, in a sub-interface, in a sub-sub-interface, *etc.*

There can be many module definitions each of which offers the same interface, *i.e.*, these are different *implementations* of the same interface. For example, the BSV library contains a repertoire of FIFO modules, all of which have the same FIFO interface type. Different implementations of a particular interface type typically differ on some dimension such as performance (latency, bandwidth, MHz), silicon area/FPGA gates, power consumption, *etc..* One chooses a particular implementation based on such practical requirements.

Sections 8.3.1, 8.4.1 and 8.5.1 show the interface declaration for BSV library modules: Registers, Register Files and FIFOs, respectively.

### 8.2.2.1 Hardware for an interface

An interface method has zero or more arguments. Its result-type falls in one of three categories, as introduced in Section 5.6.1:

- Type **Action**: possible side-effect
- Type **ActionValue #(t)**: possible side-effect that also returns a value of type *t*
- Type *t* (“value”) that is neither of the above: pure (no side-effect) returning a value of type *t*

The types of the method arguments, and the method result type, together completely determine the specific hardware (input/output wires and buses) at the module interface corresponding to that method. This is illustrated in Figure 8.1.



Figure 8.1: Hardware wires/buses for interface methods of **ActionValue**(*t*), **Action** and value result types

All methods have a READY output wire corresponding to its *implicit condition*. This is discussed in more detail in Chapter 14, but, briefly, it is a boolean value indicating when it is meaningful to invoke this interface method. For example consider the “first” method in a FIFO that returns the value at the head of the FIFO. This method is not meaningful if the FIFO is empty. Thus, its READY method is an indication that there is data available in the FIFO.

**Action** and **ActionValue**(*t*) methods have an ENABLE input wire. The environment of the module asserts this wire (drive “1” on the wire) when it wants the module to perform the method’s action (write a register, enqueue and item, dequeue an item, ...). Value methods

do not have any side-effect, *i.e.*, they do not perform any action, and so they do not have any ENABLE input.

`ActionValue(t)` and value methods have a result-data output bus (bundle of wires). Action methods, since they return no result, have not result-data output bus.

For all three categories of methods, arguments are treated in the same way: each argument of type  $t$  has an input data bus of the appropriate width to carry values of type  $t$ .

### 8.2.3 Module definitions

A *module definition* in BSV describes a module with a particular interface type:

```
(* synthesize *)
module module-name ( interface-type );
    ... instantiation of module state (registers, FIFOs, other sub-modules) ...
    ... behavior (rules and FSMs) ...
    ... interface (API) method implementations ...
endmodule
```

The `(* synthesize *)` attribute at the top is optional. With this attribute, the `bsc` compiler creates a separate Verilog module for this BSV module. Without the attribute, the compiler will “inline” it into any parent module where it is instantiated. We recommend using this attribute on all modules.<sup>1</sup>

A module definition defines a module *constructor*. The constructor can be invoked multiple times to obtain multiple *instances* of the module.

NOTE: A notable difference between BSV and other HDLs (Verilog, SystemVerilog and VHDL) is that even a lowly register is not special; it is just another module, with an interface containing “`_read()`” and “`_write()`” methods.  
In fact BSV treats *all* “state elements” (components that store persistent values) uniformly as modules with interfaces.

### 8.2.4 Module instantiation and method invocation

A module is *instantiated* using this syntax:

```
interface-type x <- constructor (constructor-arg,...,constructor-arg);
```

This creates a new instance of the module and binds the offered interface to the identifier  $x$ . Some constructors have no arguments, in which case even the parentheses surrounding the arguments can be omitted.

Subsequently, methods of the module can be invoked using the syntax

```
x.method-name (method-arg,...,method-arg);
```

Some methods have no arguments, in which case even the parentheses surrounding the arguments can be omitted.

---

<sup>1</sup>But see also Section 8.6.1 which describes some modules where this attribute is ignored by the compiler.

## 8.3 BSV Library Modules: Registers

A register is the simplest storage element in digital hardware, a single memory cell containing a single value (represented as a bit-vector). We can (over-)write with a new value, and we can read out the value stored by the most recent write.

### 8.3.1 Reg#(t), the register interface from the BSV library

The standard register interface type in BSV has two methods:

```

1  interface Reg #(t);
2      method t _read();
3      method Action _write (t x);
4  endinterface

```

Here, “t” is the type of value stored in the register (discussed in more detail below).

The `_read()` method (with no arguments) just returns the value stored in the register, of type `t`. The `_write()` method takes one argument, a value of type `t`, and stores it in the register, over-writing any previous value and holding the new value until over-written by the next `_write()`.

We will explain the `Action` type of the `_write` method in more detail later. For now, just think of it as the type of any method that is a pure side-effect, *i.e.*, the method modifies some internal state of the module, and does not return any value.

### 8.3.2 Registers are strongly typed

Unlike Verilog, SystemVerilog and VHDL, BSV registers are “strongly typed”. Each register instance can only hold values of one particular type, specified at the place where the register is instantiated.

Further, the register-contents type need not be `Bits#()`; it can more generally be *any* BSV type that has a representation in bits. Thus, the type of a value in a register can be an enum, a struct, a nested struct, *etc.*, if we have used a `deriving(Bits)` declaration (or its explicit analog) to ensure that it has a representation in bits.

Any attempt to read or write a value into a register that does not match the declared type will provoke a compile-time type-checking error from the `bsc` compiler.

### 8.3.3 mkReg(v), a register module (constructor) from the BSV library

A standard BSV library register module is `mkReg`. It is used to instantiate a new register, with a specified reset value, using a statement like this:

```

1  Reg #(Bit #(XLEN)) rg_pc <- mkReg (0);

```

Here we declare a new identifier `pc` with interface type `Reg#(Bit #(XLEN))` (the register interface type) and bind it to the interface offered by a newly instantiated register. The “0” argument to `mkReg()` specifies the reset-value of the register, *i.e.*, the value held in the register immediately after the hardware has been reset.

An alternative register constructor provided by the BSV library is `mkRegU`, where the “U” indicates that it is uninitialized, *i.e.*, has no specified reset value:

```
1 Reg #(Bit #(XLEN)) rg_pc <- mkRegU;
```

`mkRegU` instantiates a register with an unspecified (unpredictable) reset value, and hence does not need an argument.

### 8.3.4 Syntactic shorthands for register access

Registers are so ubiquitous in digital design that BSV provides some special syntactic shorthands for reading and writing registers.

Just mentioning a register in an expression can be used as a shorthand for invoking its `_read` method. Thus, the expression:

```
rg_pc + 4
```

is shorthand for:

```
rg_pc._read + 4
```

To invoke the `_write` method on a register, one can use a conventional assignment statement. Thus, the expression:

```
rg_pc._write (v)
```

can be written like this:<sup>2</sup>

```
rg_pc <= v
```

A statement like this:

```
rg_pc <= rg_pc + 4
```

contains both shorthands:

```
rg_pc._write (rg_pc._read + 4)
```

**NOTE:** The use of the “`rg_`” prefix in the above examples is just our own syntactic convention, and not required in BSV syntax, where any legal identifier can be bound to a register interface. We will be mixing identifiers bound to ordinary values and identifiers bound to register interfaces in various expressions. The `rg_` prefix reminds us that there is an implicit “`_read`” on the latter.

<sup>2</sup>Rather than use “`=`” or “`:=`” common in software programming languages, we use “`<=`”, which is the Verilog/SystemVerilog notation for “delayed assignment”.

## 8.4 BSV Library Modules: Register files

A register file is an array of registers with a common pair of methods to read or write a particular register identified by an index, which is an argument to the methods for reading and writing.

### 8.4.1 The register file interface `RegFile#(index_t,data_t)` from the BSV library

The standard register file interface type in the BSV library is:

```

1  interface RegFile #(type index_t, type data_t);
2      method Action upd (index_t addr, data_t d);
3      method data_t sub (index_t addr);
4  endinterface: RegFile

```

Here, “`index_t`” is the type for the index, which we use to identify one of the registers in the register file. For RISC-V, since we have 32 registers, we will use `Bit#(5)` as the index type.

“`data_t`” is the type of value stored in each of the registers. For RISC-V, this will be `Bit#(XLEN)`.

The `rf.upd(j,v)` method allows us to store the value  $v$  in the  $j$ ’th register of register file  $rf$ . The `rf.sub(j)` method returns the current value  $v$  in the  $j$ ’th register of register file  $rf$ .

**NOTE:** The index type `index_t` can be any type that has a representation in bits, *i.e.*, for which we have used the `deriving(Bits)` annotation in the type declaration (or for which we have provided a so-called `Bits` instance explicitly).

BSV Register files, like BSV registers, are strongly typed. At time of instantiation of a register file  $rf$ , we specify its `index_t` and `data_t` types. In subsequent uses of  $rf$ , the provided index and data value, and returned data value, must have exactly those types (else the `bsc` compiler will raise a compile-time type-error.).

### 8.4.2 `mkRegFileFull`, a register file module (constructor) from the BSV library

The BSV library contains a couple of register file modules (constructors). For RISC-V we can use `mkRegFileFull`:

```
RegFile #(Bit #(5), Bit #(XLEN)) gprs <- mkRegFileFull;
```

Here we declare a new identifier `gprs` with interface type `RegFile#(Bit#(5),Bit#(XLEN))` (the register file interface type) and bind it to the interface offered by a newly instantiated register file. The number of registers in the register file is known from the full range of the index type `Bit#(5)`, *i.e.*, it will have 32 registers, indexed from 0 to 31. Each register is `XLEN` bits wide.

## 8.5 BSV Library Modules: FIFOs

FIFOs (First-in-First-out) elements are *ordered queues* of values and are broadly useful in many hardware designs (arguably as useful as registers). We can enqueue a new value into a FIFO at the tail (back) of the queue, and dequeue a value from the head (front) of the queue. Most BSV FIFOs are automatically “flow-controlled”, *i.e.*, it is impossible to enqueue into a full FIFO and to dequeue from an empty FIFO.

### 8.5.1 FIFOF#(t), the FIFO interface from the BSV library

A standard FIFO interface type in the BSV library is:<sup>3</sup>.

```

1  interface FIFOF #(t);
2    method Bool notEmpty();
3    method Bool notFull();
4    method t first();
5    method Action deq();
6    method Action enq (t x);
7    method Action clear();
8  endinterface

```

Here, “t” is the type of values stored in the FIFO (discussed in more detail below).

The `f.notEmpty()` and `f.notFull()` are simple predicates to test if a FIFOF `f` is empty or full, respectively.

The `f.first()` and `f.deq()` methods are used to access the head of the queue. They are only available if the FIFO is not empty. The `first()` method returns the value at the head of the queue. This is non-destructive, *i.e.*, it does not modify the FIFO. The `f.deq()` method modifies the FIFO: it discards the value at head of the queue and advances the queue.

The `f.enq(x)` method is used to access the tail of the queue, and is only available if the FIFO is not full. It modifies the FIFO by appending the argument `x` to the tail of the queue.

The `clear` method is used to empty the queue immediately (discard all its contents).

Notice that the `FIFOF#(t)` interface does not indicate the *capacity* of the FIFO, *i.e.*, the number of elements it can hold from head to tail. This is deliberate; we may choose different capacities for each FIFO instance as required by its use context. We also want to be able flexibly and transparently to substitute a FIFO with another that has greater or less capacity.

#### 8.5.1.1 pop: a useful function combining first and deq

The reason that the `first` and `deq` methods are separated is because there are many situations where we wish to examine the head of a FIFO and only consume the value if it

---

<sup>3</sup>The BSV library also defines the `FIFO#(t)` interface which is the same as the `FIFOF#(t)` except that it omits the `notEmpty` and `notFull` methods. We prefer the latter, which provides more flexibility.

matches some condition of interest. But there are many situations where we don't need this generality, and we invoke both methods together in the same action. For convenience we define a function for this:

```

1  function ActionValue #(t) pop (FIFO #(t) fifo);
2      actionvalue
3          let x = fifo.first;
4          fifo.deq;
5          return x;
6      endactionvalue
7  endfunction

```

This must be an `ActionValue` function because it performs a side-effect (`deq`) and returns a value (from `first`). This is then invoked with the usual syntax for invoking ActionValues:

```
let y <- pop (fifo);
```

### 8.5.2 `mkFIFO`, a FIFO module (constructor) from the BSV library

The BSV library contains many different FIFO modules (constructors): single-element FIFOs, FIFOs of a specified depth (queue length), FIFOs with and without automatic flow-control, *etc.* In Drum we use this one:

```

1  FIFOF #(Mem_Req) f_to_IMem   <- mkFIFO;
2  FIFOF #(Mem_Rsp) f_from_IMem <- mkFIFO;

```

Here we declare a new identifier `f_to_IMem` with interface type `FIFOF#(Mem_Req)` and bind it to the interface offered by a newly instantiated FIFO. Similarly, we declare a new identifier `f_from_IMem` with interface type `FIFOF#(Mem_Rsp)` and bind it to the interface offered by a newly instantiated FIFO. Due to BSV's strong-typing, the first FIFO can only hold items of type `Mem_Req` and the second FIFO can only hold items of type `Mem_Rsp`.

Different module-constructors may or may not have arguments. This example from Fife uses a different BSV library FIFO constructor:

```
1  FIFOF #(RR_to_Retire) f_RR_to_Retire <- mkSizedFIFO (8);
```

This instantiates a FIFO whose queue capacity is 8. Note that module constructor arguments can play different roles. In `mkReg` above, the argument (0) became a dynamic value, the value held by the register after reset. Here, the argument (8) only describes *structure*, *i.e.*, the size of the FIFO.

`mkFIFO` happens to have capacity 2, although it will support sustained simultaneous enqueueing and dequeuing only when its average occupancy is  $\leq 1$  (zero or one element in the queue).

All these FIFO modules are initially empty (containing zero items) on reset, *i.e.*, at the start of time for the circuit.

### 8.5.3 FIFOs are strongly typed

Each BSV FIFO instance can only hold values of one particular type.

Further, the FIFO-contents type need not be `Bits#()`; it can more generally be *any* BSV type that has a representation in bits. Thus, the type of values in a FIFO can be an enum, a struct, a nested struct, *etc.*, if we have used a `deriving(Bits)` declaration (or its explicit analog) to ensure that it has a representation in bits.

### 8.5.4 Semi-FIFO interfaces for each end of a FIFO

FIFOs are often used to connect two separate modules together, for one module to communicate values to the next one. For example, the Fetch step communicates memory requests to memory. In this situation, one module only interacts with the “enqueue” side, and the other module only interacts with the “dequeue” side.

In these situations we will also find it useful to use the following “Semi-FIFO” interfaces interfaces for each “end” of a FIFO queue:

```

1 interface FIFOF_0 #(t);
2     method Bool notEmpty();
3     method t first();
4     method Action deq();
5 endinterface

```

```

1 interface FIFOF_I #(t);
2     method Bool notFull();
3     method Action enq (t x);
4 endinterface

```

There is no extra hardware implied here; these are simply limited “views”, or abstractions of an existing FIFO interface.

It is convenient to define a `pop_o` function to combine the `first` and `deq` methods of a `FIFOF_0`, just like we defined the `pop` function on `FIFOFs` in Section 8.5.1.1.

```

1 function ActionValue #(t) pop (FIFOF_0 #(t) fo);
2     actionvalue
3         let x = fo.first;
4         fo.deq;
5         return x;
6     endactionvalue
7 endfunction

```

This is then invoked with the usual syntax for invoking ActionValues:

```
let y <- pop (fo);
```

### 8.5.5 Interface-transformer functions

The idea of “viewing” the output-side of a FIFOF interface as a FIFOF\_O interface can be expressed in a BSV function:

```

1  function FIFOF_O #(t) to_FIFOF_O (FIFOF #(t) f);
2      interface FIFOF_O #(Mem_Req) fo_IMem_req;
3          method Bool notEmpty();
4              return f.notEmpty;
5          endmethod
6
7          method t first();
8              return f.first;
9          endmethod
10
11         method Action deq();
12             f.deq;
13         endmethod
14     endinterface
15 endinterface

```

#### Exercise 8.1:

Write a similar function to transform a FIFOF interface into a FIFOF\_I interface, with `notFull` and `enq` methods.

□

### 8.5.6 Connecting FIFOs

We will frequently want to connect the output of one FIFO to the input of another FIFO. For example, in Fife, the interface of the Fetch stage includes this semi-FIFO sub-interface to communicate `F_to_D` values to the Decode unit:

```

1  interface Fetch_IFC;
2      ...
3      interface FIFOF_O #(F_to_D) fo_Fetch_to_Decode;
4      ...
5  endinterface

```

The interface of the Decode stage has this corresponding sub-interface to receive those values:

```

1  interface Decode_IFC;
2      ...
3      interface FIFOF_I #(F_to_D) fi_Fetch_to_Decode;
4      ...
5  endinterface

```

The the CPU module, at the next level up, instantiates the Fetch and Decode stages. Then, they can be connected with a simple `mkConnection` one-liner:

```

1 module mkCPU (CPU_IFC);
2   ...
3   // Instantiate Fetch and Decode stages
4   Fetch_IFC stage_F <- mkFetch;
5   Decode_IFC stage_D <- mkDecode;
6   ...
7   // Connect the Fetch_to_Decode flow
8   mkConnection (stage_F.fo_Fetch_to_Decode, stage_D.fi_Fetch_to_Decode);
9   ...
10 endmodule

```

There is actually no magic in this! First, `mkConnection` is just another BSV module which happens to have an “Empty” interface with no interface methods, so the `mkConnection` is actually shorthand for:

```
Empty tmp <- mkConnection (stage_F.fo_Fetch_to_Decode, stage_D.fi_Fetch_to_Decode);
```

The shorthand omits the “`Empty tmp <-`” left-hand side).

The module `mkConnection` is such a useful construct that it is provided in the *bsc* library. It can be implemented in BSV itself, and is very simple:

```

1 module mkConnection #(FIFO#(Fetch_to_Decode) f,      // module argument
2                      FIFO#(Fetch_to_Decode) d)      // module argument
3                      (Empty);                  // module interface
4   rule rl_connect;
5     let x = f.first;
6     f.deq;
7     d.enq(x);
8   endrule
9 endmodule

```

`mkConnection` is a module with two arguments `f` and `d` and producing an empty interface. In BSV, Verilog and SystemVerilog syntax, a module’s arguments are provided in `#(...)` and its interface follows in `(...)`.

The module contains a *rule* (Chapter 14), which is an infinite process. It binds `x` to `f.first`, the head of the `f` queue, and discards it from the queue (`f.deq`). It enqueues the value `x` into `d`. Being an infinite process, it repeats this every time this is possible.

Because of the automatic flow-control in BSV FIFOs, this rule will only execute when `f` is non-empty (contains an item, available to dequeue) and `d` is not full (has space, available to enqueue).

**Advanced BSV topic:** What if we want to connect the two semi-FIFO interfaces with the arguments in the opposite order, *i.e.*, a FIFO\_I interface to a FIFO\_O interface? We could write a corresponding mkConnection\_I\_to\_O module. What if we want to connect an ARM AXI4 M interface to an ARM AXI4 S interface? We could write a corresponding mkConnection\_AXI4\_M\_to\_S module.

NOTE: When there are many different kinds of connection, inventing new module names `mkConnection_X_to_Y` for each pair of interface types X and Y becomes tedious.

BSV contains a mechanism called “Typeclasses” and “Typeclass instances” that allows us to reuse the name `mkConnection` for the connection module for every such pair of interface types.

In Programming Language design this issue and solutions are called “overloading”.

### 8.5.7 `mkPipelineFIFO` and `mkBypassFIFO`: constructors from the BSV library

We mention two more FIFO module constructors from the BSV library—`mkPipelineFIFO` and `mkBypassFIFO`—because they are used heavily in Fife code shown in Chapter 17.

To first approximation, they can be considered merely as substitutes for `mkFIFO`. They have subtle *performance* differences from `mkFIFO`, which is why we use them in Fife. We will discuss these performance properties in more detail in Sec 18.5.

## 8.6 Polymorphic and Monomorphic Types

The previous sections showed several *polymorphic* types:

- `Reg #(t)`
- `RegFile #(ix_width, reg_width)`
- `GPRs_IFC #(reg_width)`
- `FIFO #(t)`
- `FIFO_I #(t)`
- `FIFO_O #(t)`

In each case, there is a *type constructor* (`Reg`, `RegFile`, ..., `FIFO_O`) applied to some arguments (`t`, `ix_width`, `reg_width`). The general syntax is:

*type-constructor* `#(arg, ..., arg)`

(When a type constructor has no arguments, we can omit “`#()`”.)

When the argument of a type-constructor is an identifiers beginning with a lower-case letter, this indicates that this is a *type variable*, *i.e.*, it is place-holders for some specific type that will be instantiated later/elsewhere.

A polymorphic type (a type containing one or more type-variables) represents all possible types one can obtain by instantiating the type variables with specific, concrete types. A type with no type-variables is also called *monomorphic*.

Note that it is not just library types (`Reg`, `RegFile`, `FIFO`) that may be polymorphic. In the previous sections, we defined new types `GPRs_IFC #(reg_width)`, `FIFO_I #(t)` and `FIFO_O #(t)` that are also polymorphic.

**Exercise 8.2:**

The types `Mem_Req` and `Mem_Rsp` (Sections 6.4.2 and 6.4.4) are monomorphic. Write polymorphic versions of these types that are parameterized on `xlen`.

□

### 8.6.1 Polymorphic Modules and Synthesizability into Verilog

In Section 8.2.3 we mentioned without discussion that the “`(* synthesize *)`” attribute preceding a `module` definition controls whether the `bsc` compiler generates a Verilog module for this BSV module or whether it is inlined into its parent module wherever it is instantiated.

Not all BSV modules can be compiled one-to-one into Verilog modules. Broadly speaking, polymorphic modules cannot be separately compiled into Verilog modules. The reason is that polymorphism in BSV is very powerful and, beyond the expressive power of Verilog.

This does not mean that we cannot use polymorphic modules in a BSV design; of course we can! It just means that, at each instance of the module where we have instantiated it and fully specified the types (“monomorphized” the types), the `bsc` compiler at that place has enough information to generate Verilog for that instance.

For example, consider the polymorphic `mkRISCV_GPRs` module Section 9.2. We can directly instantiate this module in a CPU module:

```

1 module mkCPU;
2     ...
3     GPRs_IFC #(XLEN) gprs <- mkGPRs;
4     ...
5 endmodule

```

At this instantiation position, the `bsc` compiler knows the concrete value (`XLEN`) of the type-variable `xlen`, and so can generate Verilog code for this `mkRISCV_GPRs` instance. In the final Verilog code, there will not be any separately identifiable Verilog code for the register file, it would just be part of the `mkCPU` module Verilog.

If we really wanted to generate a separately identifiable Verilog module for the register file, we can write a monomorphic wrapper module:

```

1 (* synthesize *)
2 module mkGPRs_V (GPRs_IFC #(XLEN));
3     GPRs_IFC #(XLEN) ifc <- mkGPRs;
4     return ifc;
5 endmodule

```

This module `mkRISCV_GPRs_V` is monomorphic because the type-variable `xlen` has been instantiated to the concrete type `XLEN`, and so the “`(* synthesize *)`” attribute will be honored by the `bsc` compiler to produce a corresponding Verilog module.

We can then replace our earlier instantiation of `mkRISCV_GPRs` in the CPU module with this monomorphic module:

```
1 module mkCPU;
2     ...
3     GPRs_IFC #(XLEN) gprs <- mkGPRs_V;
4     ...
5 endmodule
```

---

**Exercise 8.3:**

In `mkRISCV_GPRs_V` replace the explicit type declaration of `ifc` with the `let` keyword.

**Exercise 8.4:**

Write a monomorphic wrapper for the BSV library `mkFIFO` module that specializes it into a FIFO that only carries `Bool` values. Add an annotation so the wrapper becomes a separate Verilog module. Compile it and study the generated Verilog.

□

---



# Chapter 9

## RISC-V: Modules for GPRs and CSRs

### 9.1 Introduction

In this chapter we discuss modules for the RISC-V GPRs (General Purpose Registers) and CSRs (Control and Status Registers).

### 9.2 A register file for GPRs, with special treatment of x0

In RISC-V, register `x0` (index 0) is defined as “always zero”. Any value written to `x0` is ignored/discard, and any read from `x0` always returns 0. So, presumably, we do not need an actual register to hold this value, just some circuitry to ensure that we always get 0 when we try to “read” from `x0`.

In the previous section, we used the module `mkRegFileFull` to instantiate a register file with 32 registers (inferring 32 from the full range of the index type `Bit#(5)`). Instead, we could use an alternate register file module from the BSV library that allows us to provide, as module constructor arguments, the lower and upper indexes of interest. This instantiates exactly 31 registers indexed from 1 to 31, thereby saving `XLEN` bits of register state in our hardware.

```
RegFile #(Bit #(5), Bit #(XLEN)) gprs <- mkRegFile (1, 31);
```

Regardless of whether we instantiated 31 or 32 registers, RISC-V instructions can (and do) use `x0` as a source or destination register, so we need circuitry to deal with attempts to read/write `x0`. One possible solution is to make a “wrapper” module `mkGPRs` around the library register file module.

Although we could have reused the `RegFile #(t1,t2)` interface, we take the opportunity to define a new interface `GPRs_IFC` that has some RISC-V specific method and argument names, for reading the `rs1`, `rs2` values (register source 1 and 2) and writing the `rd` value (register destination):

```
src_Common/GPRs.bsv: line 25 ...  
1  interface GPRs_IFC #(numeric type xlen);  
2      method Bit #(xlen) read_rs1 (Bit #(5) rs1);
```

```

3   method Bit #(xlen) read_rs2 (Bit #(5) rs2);
4     method Action      write_rd (Bit #(5) rd, Bit #(xlen) rd_val);
5   endinterface

```

Here is the module implementing the interface:

```

src_Common/GPRs.bsv: line 41 ...
1 module mkGPRs (GPRs_IFC #(xlen));
2   RegFile #(Bit #(5), Bit #(xlen)) rf <- mkRegFileFull;
3
4   method Bit #(xlen) read_rs1 (Bit #(5) rs1);
5     return ((rs1 == 0) ? 0 : rf.sub (rs1));
6   endmethod
7
8   method Bit #(xlen) read_rs2 (Bit #(5) rs2);
9     return ((rs2 == 0) ? 0 : rf.sub (rs2));
10  endmethod
11
12  method Action write_rd (Bit #(5) rd, Bit #(xlen) rd_val);
13    rf.upd (rd, rd_val);
14  endmethod
15 endmodule

```

The module instantiates a library register file `rf`. The methods simply invoke the underlying `rf` methods. The read-methods override this by returning 0 when the index is 0.

---

**Exercise 9.1:** In `mkGPRs` we write the value when  $j$  is zero, but we ignore it on reads. Write a variant where, in `write_rd`, we always write 0 when the index `rd` is zero, and the read methods no longer check if `rs1` or `rs2` are 0.

In this variant, what happens if we try to read `x0` before it is written for the first time?

Compare the circuitry generated in the original and in the variant. Why might we choose one over the other?

**Exercise 9.2:** In `mkGPRs` suppose we use `mkRegFile(1,31)` instead of `mkRegFileFull`. What needs to change to accommodate this?

□

---

### 9.2.1 Inlined vs. separate module `mkGPRs`

The above interface and module definitions were parameterized with “`xlen`” which is a type-variable (starting with a lower-case letter). The interface and module are thus polymorphic in the width of the data stored in the registers. Thus, this module can be instantiated in RV32 and RV64 designs, instantiating `xlen` to 32 and 64, respectively.

Being polymorphic, it will also be inlined wherever it is instantiated. If you look at the generated Verilog, there will be no sign of any `mkGPRs` module; the module code will have been integrated (inlined) into the parent module’s code.

A useful trick is to write a thin, non-polymorphic wrapper for the module; being non-polymorphic, it can be compiled without inlining into a distinct Verilog module:

```
src_Common/GPRs.bsv: line 60 ...
1 (* synthesize *)
2 module mkGPRs_synth (GPRs_IFC #(XLEN));
3     let ifc <- mkGPRs;
4     return ifc;
5 endmodule
```

Here, we have instantiated the module using `XLEN` which is not a type-variable, it is defined specifically as 32 or 64 (see Sec 5.13.4). The `(*synthesize*)` attribute can now be respected by the `bsc` compiler and it will produce a Verilog module `mkGPRs_synth`.

### 9.3 A register file for RISC-V CSRs

The RISC-V CSRs (Control and Status Registers) are not “just another register file”. Here are some significant differences:

- There are 32 GPRs, addressed with a 5-bit index. All of them (except one, `x0`) are used in programs. We can thus use a *dense*, packed implementation (`RegFile` from the BSV library).

CSRs are addressed with a 12-bit index (CSR address). But in our implementation we will use just a handful of CSRs (`mtvec`, `mepc`, `mcause`, `mtval` and a few more), with non-consecutive addresses, *i.e.*, the addresses are *sparse*; many (most) 12-bit address values are unused.

- Each GPR (except for `x0`) is “memory-like”. When a value is written, it remains available for all subsequent reads until the next write. All those reads return the same value—the value most recently written.

CSR reads can, in general, have side effects. CSR writes, in addition to writing a value, can have other side effects as well. A CSR read may not return the same value as the value most recently written.

For these reasons, we implement each CSR with a separate, ordinary register. Here are the CSRs we need for exception-handling:

```
src_Common/CSRs.bsv: line 61 ...
1 Reg #(Bit #(XLEN)) csr_mtvec    <- mkReg (0);
2 Reg #(Bit #(XLEN)) csr_mepc    <- mkReg (0);
3 Reg #(Bit #(XLEN)) csr_mcause   <- mkReg (0);
4 Reg #(Bit #(XLEN)) csr_mtval   <- mkReg (0);
```

Here are the definitions of standard RISC-V 12-bit addresses for these CSRs:

```
src_Common/CSR_Bits.bsv: line 23 ...
1 Bit #(12) csr_addr_MTVEC      = 'h305;
2 Bit #(12) csr_addr_MEPC      = 'h341;
3 Bit #(12) csr_addr_MCAUSE    = 'h342;
4 Bit #(12) csr_addr_MTVAL     = 'h343;
```

To write a CSR we use a case statement that selects on the CSR address and writes to the particular CSR.

```

src_Common/CSRs.bsv: line 89 ...
1   function ActionValue #(Bool)
2       fav_csr_write (Bit #(12) csr_addr, Bit #(XLEN) csr_val);
3   actionvalue
4       ...
5       Bool exception = False;
6       case (csr_addr)
7           ...
8           csr_addr_MTVEC:    csr_mtvec    <= csr_val;
9           csr_addr_MEPC:    csr_mepc     <= csr_val;
10          csr_addr_MCAUSE:  csr_mcause   <= csr_val;
11          csr_addr_MTVAL:   csr_mtval   <= csr_val;
12
13          csr_addr_MCYCLE:   if (xlen == 32)
14              csr_mcycle [1] <= {csr_mcycle [1] [63:32],
15                                     csr_val [31:0] };
16          else
17              csr_mcycle [1] <= zeroExtend (csr_val);
18          csr_addr_MCYCLEH:   if (xlen == 32)
19              csr_mcycle [1] <= {csr_val [31:0],
20                                 csr_mcycle [1] [31:0] };
21          else
22              exception = True;
23
24          csr_addr_MINSTRET:  if (xlen == 32)
25              csr_minstret <= {csr_minstret [63:32], csr_val [31:0]};
26          else
27              csr_minstret <= zeroExtend (csr_val);
28          csr_addr_MINSTRETH: if (xlen == 32)
29              csr_minstret <= {csr_val [31:0], csr_minstret [31:0]};
30          else
31              exception = True;
32
33          default:             exception = True;
34      endcase
35      return exception;
36  endactionvalue
37 endfunction

```

This function returns boolean True on success, and False on failure. For the moment, failure only means that the argument CSR address was bad, *i.e.*, it referred to some unknown CSR. Failure can also occur if we try to write to a “read-only” CSR (we will see an example later, CSR TIME).

Similarly, to read a CSR we use a case statement that selects on the CSR address and reads the particular CSR.

```

src_Common/CSRs.bsv: line 135 ...
1   function ActionValue #(Tuple2 #(Bool, Bit #(XLEN)))
2       fav_csr_read (Bit #(12) csr_addr);
3   actionvalue

```

```

4      ...
5      Bool           exception = False;
6      Bit #(XLEN)   y          = ?;
7      case (csr_addr)
8          ...
9          csr_addr_MTVEC:    y = csr_mtvec;
10         csr_addr_MEPC:    y = csr_mepc;
11         csr_addr_MCAUSE:   y = csr_mcause;
12         csr_addr_MTVAL:   y = csr_mtval;
13
14         csr_addr_CYCLE:    y = truncate (csr_mcycle [1]);
15         csr_addr_CYCLEH:   if (xlen == 32)
16                           y = csr_mcycle [1] [63:32];
17                           else
18                               exception = True;
19
20         csr_addr_MCYCLE:   y = truncate (csr_mcycle [1]);
21         csr_addr_MCYCLEH:  if (xlen == 32)
22                           y = csr_mcycle [1] [63:32];
23                           else
24                               exception = True;
25
26         csr_addr_TIME:     y = truncate (crg_csr_TIME [1]);
27         csr_addr_TIMEH:   if (xlen == 32)
28                           y = crg_csr_TIME [1] [63:32];
29                           else
30                               exception = True;
31
32         csr_addr_MINSTRET: y = truncate (csr_minstret);
33         csr_addr_MINSTRETH:if (xlen == 32)
34                           y = csr_minstret [63:32];
35                           else
36                               exception = True;
37         csr_addr_INSTRET:  y = truncate (csr_minstret);
38         csr_addr_INSTRETH:if (xlen == 32)
39                           y = csr_minstret [63:32];
40                           else
41                               exception = True;
42         default:           exception = True;
43     endcase
44     ...
45     return tuple2 (exception, y);
46 endactionvalue
47 endfunction

```

This function returns a pair of values (a 2-tuple). The first component, like the CSR-write function, is a boolean, True on success and False on failure. The second component is the value read from the CSR.

For the CSRs module we do not export the above read and write functions directly; they are used internally inside the module. The interface declaration looks like this:

```

src_Common/CSRs.bsv: line 29 ...
1 interface CSRs_IFC;
```

```

2   method Action init (Initial_Params initial_params);
3
4   // CSRRXX instruction execution
5   // Returns (True, ?) if exception else (False, rd_val)
6   method ActionValue #(Tuple2 #(Bool, Bit #(XLEN)))
7       mav_csrrxx (Bit #(32) instr, Bit #(XLEN) rs1_val);
8
9   // Trap actions
10  // Returns PC from MTVEC for trap handler
11  method ActionValue #(Bit #(XLEN))
12      mav_exception (Bit #(XLEN) epc,
13                      Bool           is_interrupt,
14                      Bit #(4)        cause,
15                      Bit #(XLEN)    tval);
16
17  method Bit #(XLEN) read_epc;
18  method Action ma_incr_instret;
19
20  // Set TIME
21  (* always_ready, always_enabled *)
22  method Action set_TIME (Bit #(64) t);
23 endinterface

```

We will discuss these interface methods in more detail when we discuss trap-handling in Chapter 11. Suffice it to say, for now, that:

- the `mav_csrrx` method directly implements the CSRRxx instructions;
- the `mav_exception` method directly implements the CSR reads and writes needed when taking a trap, and
- the `read_epc` method directly reads the MEPC CSR as needed by the MRET instruction.

The former was described in Section 2.7.2, and the latter two were described in Section 2.7.

```
// ****
```

# Chapter 10

## BSV: FSMs

### 10.1 Introduction

So far, we have only been discussing pure combinational functions, for which there is no concept of time. Combinational functions are just pure mathematical functions, “instantaneously” transforming input values to output values. In actual circuits, there is of course a finite delay through wires and gates, as dictated by physics, but in the digital abstraction of time we think of this as instantaneous.

However, a CPU, as shown in Figure 10.1 represents a *processs*, a behavior that evolves over



Figure 10.1: Simple interpretation of RISC-V instructions (same as Fig. 6.1)

time. For example the Drum CPU executes one full instruction after another, repeating forever the flow along the black arrows in the diagram. For each instruction, first it performs a Fetch operation, which sends a request to memory. Some time later, the memory sends back a response, which is then processed by the Decode step, Register-Read-and-Dispatch step and then one of the Execute steps. The Execute DMem step sends a request to memory.

Some time later, the memory sends back a response, which is processed in the Retire step. Finally, the process loops back to the Fetch step, and repeats for the next instruction.

The simplest temporal process in hardware systems, perhaps the most classical, is the *FSM* (Finite State Machine). A classical notation for an FSM are so-called “bubble-and-arrow” diagrams: each bubble represents a distinct *state* of the system, and the arrows connecting bubbles represent *transitions* between states. Transitions are typically enabled by some boolean condition on the current state, and/or availability of some particular input from the environment. A transition moves the system to another state, and may output something as well to the environment. In bubble-and-arrow diagrams, each arrow is often labeled with the condition that enables the transition, and by any outputs produced by the transition.

Figure 10.1 can be interpreted as a bubble-and-arrow FSM diagram: each yellow rectangle (bubble) is a state, and the process transitions from state to state, thereby executing RISC-V instructions. This is exactly what the Drum CPU does. In our diagram, arrow label are not conditions and actions, rather the information (a struct) that is produced in one state and consumed by another.

In this chapter we describe some special constructs for FSMs in BSV. In the next chapter we will discuss using these FSM constructs to code the Drum CPU.

**NOTE:** In the literature one may read about FSMs where, from a state, we could have a choice of transitions to different destination states. These are called *non-deterministic* FSMs. In our designs we are only concerned with *deterministic* FSMs where the conditions on the arrows emerging from a state are mutually exclusive, so that they always identify a unique possible next state.

### 10.1.1 Sequential FSMs, Concurrent FSMs, and Digital Hardware

Classical FSMs in the literature are *sequential* FSMs—every transition is from the current state to a unique, particular next state.<sup>1</sup>

Any digital system (including an entire computer) can theoretically be viewed as a single (possibly giant!) FSM. The current-state is the current collective state of every register bit and every memory bit in the system. State transitions depend on the current state and any external inputs. Although theoretically correct, this is an impractical, non-scalable, and not very useful way of viewing complex digital systems.

Most non-trivial digital hardware systems are better viewed as composed of multiple *concurrent, communicating FSMs*, *i.e.*, multiple classical FSMs running concurrently and independently and communicating with each other (an output from one FSM may be an input to another FSM). Different BSV module instances are separate FSMs, each running their own process(es). These separate FSMs may communicate with each other *via* shared state (registers, FIFOs, register files, *etc.*). This is a more *modular* and *scalable* way to think about complex digital systems.

Even though Drum CPU execution is a sequential FSM, the overall system can be viewed as a pair of concurrent FSMs: Drum and the Memory System (see Figure 13.1). Each has

---

<sup>1</sup>Even in non-deterministic FSMs, though there may be several possible next-states, exactly one next-state is non-deterministically chosen.

its own internal FSM process and behavior, and they communicate memory requests and responses back and forth.

For Fife, we will interpret *each* yellow box in Figure 10.1 as its own FSM. They all run concurrently (and concurrently with the memory system), and communicate various struct values (labels in Figure 10.1) between the FSMs.

## 10.2 Rules and StmtFSM in BSV

The fundamental behavioral/temporal primitive in BSV, to specify processes, is the *rule* but we will postpone a detailed discussion of rules until Chapters 14 and 18. Rules can express arbitrary process flows, both structured and unstructured. BSV’s **StmtFSM** is a higher-level notation that captures the idea of certain *structured* processes. **StmtFSM** is ultimately implemented (by the *bsc* compiler) with rules, so it does not add any fundamental semantic power to the language. Section 14.7 discusses when to use **StmtFSM** and when to use rules.

HISTORICAL  
NOTE:

### Structured Programming

In software program execution, the fundamental temporal “flow” primitive (in assembly language, for example) is from one instruction to the next, or a BRANCH or JUMP that can go to any instruction. With these, we can create arbitrary control flows.

However, in most programming languages, we work at a much higher-level of abstraction. Instead of using BRANCH and JUMP willy-nilly, we use linguistic structures as if-then-else, case statements while loops, repeat loops, for loops, function calls and returns, *etc..* Further, these structures *compose* nicely, *i.e.*, they can be nested cleanly inside each other.

Ultimately all these constructs are implemented (by a compiler) in terms of the primitives, instruction sequences/BRANCH/JUMP, and in that sense they do not add any fundamental new semantic power to programming. However, these structured constructs are useful for humans: for readability, for guiding our thinking, and for reasoning about the correctness of programs.

Early programmers, *e.g.*, using the first versions of the Fortran programming language, worked directly with statement sequences (the analog of instruction sequences) and “GOTO” statements (the analog of BRANCH/JUMP). Because GOTO could go to any statement in the program, programs often became a mess of control transfers without any structure; such programs are sometimes called “spaghetti code”.

In the 1960s, Edsger Dijkstra, a Dutch computer scientist (later to win the ACM Turing Award), recognized the importance of clear *structure* in composing programs. In 1968 he wrote a famous letter to the editor of the Communications of the ACM journal (Association for Computing Machinery) titled “Go To Statement Considered Harmful”, a searing indictment of unstructured code.

The idea of constructing programs cleanly with nestable (composable) structures (if-then-else, while-loops *etc.*), so called “structured programming”, grew out of that thinking; it is something we take for granted in today’s programming languages.

**StmtFSM** is a sub-language within BSV for structured FSMs. The sub-language which we

use in Drum has the following grammar:<sup>2</sup>

```
Stmt ::= an Action
      | a sequence (seq block) of Stmt
      | an if-then-else of Stmt (conditionals)
      | a while-loop around a Stmt
```

We can see that the basic primitives are **Action**s, from which we can build FSMs by nesting in sequences, conditionals and loops. We will use these **StmtFSM** constructs to code Drum, which is a simple sequential process. **StmtFSM** will not be adequate to code Fife, which is a collection of concurrent FSMs; for that, we will use rules explicitly.

**NOTE:** We already briefly encountered a simple **StmtFSM**, with just a sequential block, in the testbench in Section 5.7.

### 10.3 Actions and the Action type

The fundamental building-block for **StmtFSM** is the “action”, which is a statement/expression of type **Action**. Some common examples:

```
1 rg_pc <= rg_pc + 4;           // Assignment to a register
2 f_Fetch_to_Decode.deq;         // Dequeue a fifo
3 f_Decode_to_RR.enq (v);       // Enqueue into a fifo
4 $display ("Hello, World!");    // Print something (in simulation only)
```

We discussed the **Action** (and **ActionValue**) types in Section 5.6.1. We used them in the return type of the **fn\_Fetch** function in Section 7.2. To recap, an expression with **Action** or **ActionValue** type is one that potentially has a side effect, as in each of the above example statements.

As discussed in Section 8.3.4 the first assignment statement is syntactic shorthand for:

```
1 rg_pc._write (rg_pc._read + 4)
```

i.e., it is an invocation of the register **\_write** method which, as described in Section 8.3.1 has type **Action**. Similarly, as described in Section 8.5.1, fifo **enq** and **deq** methods have return-type **Action**, so the statements **f\_D\_to\_RR.enq (v)** and **f\_D\_to\_RR.enq (v)** have type **Action**.

**\$display()** is a built-in construct in BSV that also has type **Action**.

#### 10.3.1 Action blocks: composing actions into larger actions

The **Action** type is recursive: it is either a primitive action (like those described above), or it is a collection of things of type **Action**, collected using an **action** block (bracketed by the BSV keywords **action** and **endaction**). For example the above primitive actions can be collected into a single entity which itself has type **Action**:

<sup>2</sup>Section 10.10 describes more available features in **StmtFSM**

```

1   action
2     rg_pc <= rg_pc + 4;           // Assignment to a register
3     f_F_to_D.deq;                // Dequeue a fifo
4     f_D_to_RR.enq (v);          // Enqueue into a fifo
5     $display ("Hello, World!"); // Print something (in simulation only)
6   endaction

```

Although the actions in an `action` block must be written in some textual order, `action blocks` do not involve any sequencing. There is no temporal ordering of the actions in an `action` block. All the actions in an `action` block (either directly in the block or, recursively in a sub-block) occur “instantly” and “simultaneously”. In the example above, lines 2-5 could have been written in any order with no change in meaning/behavior.

### 10.3.2 Binding names in Action blocks

It is often convenient to give a meaningful name to a sub-expression in an `Action` block. For example:

```

1   action
2     Bit #(XLEN) next_pc = rg_pc + 4;
3     rg_pc <= next_pc;
4     $display ("Next PC is %08h", next_pc);
5   endaction

```

Here, we bind the identifier `next_pc` in line 2, and then use it in lines 3 and 4. We can often replace the type in the binding with the keyword `let`, if the type is obvious from the context (the `bsc` compiler will figure it out):

```

1   action
2     let next_pc = rg_pc + 4;
3     rg_pc <= next_pc;
4     $display ("Next PC is %08h", next_pc);
5   endaction

```

The *scope* of the identifier, i.e., the region of program text where it is available for use, are the remaining statements in the `Action` block (including inside any syntactically nested statements).

A binding, per se, is not an action! It is just a convenience, giving a name to the value of the right-hand side expression.

Bindings (whether with a type or with `let`) impose some syntactic ordering on statements in the block: a binding of an identifier must precede any use of that identifier. In the previous two examples, line 2 (the binding) must precede lines 3 and 4 (the actions), but lines 3 and 4 could be written in the opposite order. Note, this is not a *temporal* sequencing; all actions in the block still occur simultaneously.

## 10.4 StmtFSM: sequences of actions

Our first construct that has temporal behavior is the `seq-endseq` block. Each item in the block is an entity of type `Action` or `Stmt`, and they are performed sequentially, one after another.

```
seq
  ... an Action or Stmt ... ;
  ...
  ... an Action or Stmt ... ;
endseq
```

The entire `seq` block itself has type `Stmt`, and so it can be nested inside other `StmtFSM` constructs.

The testbench in Section 5.7 contains an example of a `seq` block.

## 10.5 StmtFSM: conditionals (if-then-else)

Conditional process execution can be expressed with traditional if-then-else notation:

```
if
  ... an Action or Stmt ...
else
  ... an Action or Stmt ...
```

The entire if-then-else construct itself has type `Stmt`, and so it can be nested inside other `StmtFSM` constructs.

In Section 5.11 we described ordinary BSV if-then-else expressions, which often represent hardware multiplexers, where both arms are “evaluated” (the hardware exists for both sides) and the multiplexer merely selects one of the two outputs.

**NOTE:** StmtFSM uses the same notation, but here it represents a *process*, and only one of the two arms is executed (like if-then-else in most software programming languages).

There is no ambiguity in these two uses of if-then-else notation—the context always clearly distinguishes what we mean, because there is no overlap between ordinary expressions and StmtFSM constructions.

## 10.6 StmtFSM: while-loops

Repetitive processes can be expressed with traditional while-loop notation:

|   |                                                |
|---|------------------------------------------------|
| 1 | <code>while (... Bool expression ...)</code>   |
| 2 | <code>  ... expression of type Stmt ...</code> |

The entire while-loop construct itself has type `Stmt`, and so it can be nested inside other `StmtFSM` constructs.

## 10.7 StmtFSM: pausing until some condition holds

An action in a `StmtFSM` can be the `await(b)` action, which simply waits until the boolean expression in its argument evaluates to true:

```
1   await (... Bool expression ...);
```

Of course, because it is paused, the `Stmt` containing an `await` cannot itself cause the value to change since it cannot change any state that would affect the expression. The state-change thus has to be effected by some other part of the BSV design (a concurrent FSM, or a rule), not this particular `Stmt`.

## 10.8 StmtFSM: mkAutoFSM: a simple FSM module constructor

Creating an FSM using `StmtFSM` in BSV is a two-step process:

- Define the desired FSM behavior as an entity of type `Stmt`. Think of this as a *specification* of desired behavior.
- Instantiate a module that takes this specification and implements the behavior.

There are several `Stmt` → module constructors available in the BSV library. In this book we use only one of them, `mkAutoFSM`:

```
1   mkAutoFSM (... argument expression of type Stmt ...);
```

This creates an FSM with the behavior specified by the `Stmt` argument. The FSM starts running immediately as we come out of reset, starting at the first statement, and terminates the entire simulation when we fall through the last statement. Of course, it may never terminate if it contains an infinite `while` loop.

Note: “terminating” the simulation means it executes a `$finish()` action which stops the complete simulation. In hardware (where there is no concept of `$finish()`) the FSM simply goes idle.<sup>3</sup> There may be other FSMs and rules in the design that continue running.

## 10.9 StmtFSM in testbenches

`StmtFSM` is often used in testbenches, because we want to sequence the delivery of stimulus to the Design-Under-Test (DUT) and sequence the collection and display/checking/analysis of the resulting outputs; `StmtFSM` is a quick and simple way to write such sequences. We showed an example of a small testbench using `StmtFSM` in Section 5.7. When the DUT is pipelined, we use two FSMs, one to deliver a sequence of stimulus inputs while the other one concurrently processes the sequence of output results.

`StmtFSM` can also be used in designs; indeed, that is what we are going to describe in Chapter 11 for the Drum CPU. Section 14.7 discusses when to use `StmtFSM` and when to use rules.

---

<sup>3</sup>Technically, when none of the rules implementing the FSM can fire any more.

## 10.10 StmtFSM: many more features

In this chapter we have covered all the features of `Stmt` that we need for coding Drum. There are many additional constructors for `Stmt` that we skip here, such as for-loops and repeat-loops. Perhaps the most interesting are fork-join concurrent blocks (`par-endpar`), which specify concurrent FSMs.

The type `Stmt` is a first-class type in BSV. You can write functions that have arguments and results of type `Stmt`. For example, you might write a function that takes a numeric argument and returns a `Stmt`; this function can be invoked more than once to produce multiple `Stmts` that differ because of the parameter. Each of these `Stmts` can be instantiated into its own FSM. This further emphasizes the view that `Stmt` is a specification of an FSM behavior that is then instantiated in a module to produce that behavior.

The `StmtFSM` package also defines many FSM module constructors (alternatives to `mkAutoFSM`), including modules with explicit start, stop and restart methods, modules that can be paused based on a boolean argument, and more.

We refer the reader to the FSM chapter in the *bsc Libraries Reference Guide* [25].

# Chapter 11

## RISC-V: the Drum unpipelined CPU (an FSM)

### 11.1 Introduction

In this chapter we use code Drum's behavior, illustrated in Figure 11.1, using BSV's StmtFSM construct to code the FSM.



Figure 11.1: Simple interpretation of RISC-V instructions (same as Fig. 6.1)

### 11.2 The Drum CPU module interface

The Drum CPU interface is shown below.

```
1  interface CPU_IFC;
2      method Action init (Initial_Pararms initial_params);
```

```

3   // IMem
4   interface FIFOF_0 #(Mem_Req) fo_IMem_req;
5   interface FIFOF_I #(Mem_Rsp) fi_IMem_rsp;
6   ...
7   // DMem, non-speculative
8   interface FIFOF_0 #(Mem_Req) fo_DMem_req;
9   interface FIFOF_I #(Mem_Rsp) fi_DMem_rsp;
10
11  // Set TIME
12  (* always_ready, always_enabled *)
13  method Action set_TIME (Bit #(64) t);
14
15 endinterface

```

The interface is simple:

- The `init` method carries an `Initial_Params` struct containing any initial values needed by the CPU. A typical field is the initial value of the PC, since different software systems make different assumptions about the “starting address” for code. In many RV32I example codes, the starting address is `'h_8000_0000`.
- A `FIFOF_O` interface to carry memory requests for instructions (out-bound from the CPU to the memory);
- A `FIFOF_I` interface to carry corresponding memory responses containing instructions (in-bound from memory to the CPU);
- A `FIFO_O` interface to carry memory requests from load/store instructions (out-bound from the CPU to the memory);
- A `FIFO_I` interface to carry corresponding load/store memory responses (in-bound from memory to the CPU).

(Please re-read Section 6.4.1 for the discussion on Harvard architectures, which have separate memory-access channels for instructions (Fetch, IMem) and for data (LOAD/STORE, DMem)).

Later we will see that Fife shares this interface, so that the Fife and Drum CPUs are easily interchangeable in any system or testbench.

### 11.3 The Drum CPU module

The STATE and INTERFACE sections of the Drum CPU module are shown below (we will discuss the elided BEHAVIOR section shortly).

```

src_Drum/CPU.bsv: line 76 ...
1  (* synthesize *)
2  module mkCPU (CPU_IFC);
3  // =====
4  // STATE
5
6  // Don't run until the PC (and other things) are initialized
7  Reg #(Bool) rg_running <- mkReg (False);
8
9  // The Program Counter

```

```

10  Reg #(Bit #(XLEN)) rg_pc    <- mkReg (0);
11
12 // General-Purpose Registers (GPRs)
13 GPRs_IFC #(XLEN) gprs <- mkGPRs_synth;
14
15 // Control-and-Status Registers (CSRs)
16 CSRs_IFC csrs <- mkCSRs;
17
18 // Inter-step registers
19 Reg #(Fetch_to_Decode)      rg_Fetch_to_Decode      <- mkRegU;
20 Reg #(Decode_to_RR)         rg_Decode_to_RR        <- mkRegU;
21 Reg #(Result_Dispatch)     rg_Dispatch            <- mkRegU;
22 Reg #(EX_Control_to_Retire) rg_EX_Control_to_Retire <- mkRegU;
23 Reg #(EX_to_Retire)        rg_EX_to_Retire        <- mkRegU;
24
25 // Paths to and from memory
26 FIFOF #(Mem_Req) f_IMem_req <- mkFIFOF;
27 FIFOF #(Mem_Rsp) f_IMem_rsp <- mkFIFOF;
28
29 FIFOF #(Mem_Req) f_DMem_req <- mkFIFOF;
30 FIFOF #(Mem_Rsp) f_DMem_rsp <- mkFIFOF;
31
32 // Regs to set up exception handling
33 Reg #(Bool)      rg_exception <- mkReg (False);
34 Reg #(Bit #(XLEN)) rg_epc      <- mkRegU;
35 Reg #(Bit #(4))   rg_cause     <- mkRegU;
36 Reg #(Bit #(XLEN)) rg_tval     <- mkRegU;

```

(... BEHAVIOR elided, to be discussed shortly ...)

```

src_Drum/CPU.bsv: line 390 ...
1  method Action init (Initial_Params initial_params);
2
3     ...
4     rg_pc      <= initial_params.pc_reset_value;
5     rg_running <= True;
6     endmethod
7
8     // IMem
9     interface fo_IMem_req = to_FIFOF_O (f_IMem_req);
10    interface fi_IMem_rsp = to_FIFOF_I (f_IMem_rsp);
11
12    // DMem, non-speculative
13    interface fo_DMem_req = to_FIFOF_O (f_DMem_req);
14    interface fi_DMem_rsp = to_FIFOF_I (f_DMem_rsp);
15
16    // Set TIME
17    method Action set_TIME (Bit #(64) t) = csrs.set_TIME (t);
endmodule

```

The STATE section first instantiates a register `rg_running`, initially False, that will be set to True by the `init` method that initializes the PC to a specific initial value. The FSM will wait for this before starting the first instruction-fetch.

Then, `mkCPU` instantiates a register for the PC, and then the GPRs module (described in Section 9.2) and the CSRs modules (described in Section 9.3). Then, it instantiates a set of registers to hold values between temporal FSM steps (with struct types shown in Figure 11.1). For example, the Fetch step will write a value into `rg_Fetch_to_Decode` which will be read later by the Decode step. Then, it instantiates four FIFOs for IMem requests (outgoing) and responses (incoming), and for DMem requests (outgoing) and responses (incoming). As mentioned before, we do not make any assumption about the *latency* of memory requests, *i.e.*, how long it takes the external memory subsystem to consume a request from one of the request FIFOs and enqueue a response into the corresponding response FIFO. Finally, it instantiates the four registers needed for trap-handling, described in Section 2.7.

In the display above we have elided the BEHAVIOR section of the module, which we will describe shortly.

In the INTERFACE section, the `init` method initializes the PC and sets `rg_running` to true, releasing the BEHAVIOR section to start executing. We use the interface transformers discussed in Section 8.5.5 that produce Semi-FIFO “views” of FIFOs to lift the FIFO interfaces into the module interface.

## 11.4 Help-functions for the Drum CPU module behavior

Before we look at the FSM implementing Drum behavior, we first have some Action functions that encapsulate some common actions performed in several states in the FSM.

**NOTE:** In BSV, function definitions do not have to be at the top-level of a file, in fact they can be defined at any nested level. Here, we define these inside a module.

The following function writes an rd-value into the the GPRs, in those cases where the instruction has an rd (`fn_Decode`, described in Section 7.3, computed `has_rd` for each kind of instruction):

```
src_Drum/CPU.bsv: line 124 ...
1   function Action fa_update_rd (RR_to_Retire x1,
2                                 Bit #(XLEN) rd_val);
3     action
4       if (x1.has_rd) begin
5         let rd = instr_rd (x1.instr);
6         gprs.write_rd (rd, rd_val);
7         ...
8       end
9     endaction
10    endfunction
11
```

The following function is the last action during an instruction’s execution, updating the PC and the instruction number:

```
src_Drum/CPU.bsv: line 138 ...
1   function Action fa_redirect_Fetch (Bit #(XLEN) next_pc);
2     action
```

```

3      rg_pc    <= next_pc;
4      rg_inum <= rg_inum + 1;
5      endaction
6  endfunction

```

The following function saves values into registers `rg_epc`, `rg_cause` and `rg_tval`, when a trap is detected. These will later be written into the corresponding CSR registers during trap-handling.

```

src_Drum/CPU.bsv: line 145 ...
1   function Action fa_setup_exception (Bit #(XLEN) epc,
2                                     Bit #(4) cause,
3                                     Bit #(XLEN) tval);
4     action
5       rg_exception <= True;
6       rg_epc      <= epc;
7       rg_cause     <= cause;
8       rg_tval      <= tval;
9     endaction
10    endfunction

```

## 11.5 The main behavior actions in the Drum CPU module

The following sub-sections show the main actions performed by Drum, corresponding to the components in Figure 11.1.

**NOTE:** In BSV, the type `Action` is a first-class type. This means we can write “action expressions” and bind the result (of type `Action`) to a name, and then refer to that action later by that name. We use this capability below first to define the Drum FSM actions; then we will embed these actions in the Drum FSM in Section 11.6. We also reuse exactly the same actions in the alternate specification of Drum’s behavior using Rules instead of StmtFSM in Chapter 15.

### 11.5.1 FSM action for Fetch

```

src_Drum/CPU.bsv: line 160 ...
1   Action a_Fetch =
2     action
3       ...
4       let y <- fn_Fetch (rg_pc,
5                           ...
6                           rg_Fetch_to_Decode <= y.to_D;
7                           f_IMem_req.enq (y.mem_req);
8                           ...
9     endaction;

```

This action applies `fn_Fetch()` (Section 7.2) to the PC, stores the `to_D` part of the result in register `rg_Fetch_to_Decode`, and sends the `mem_req` part of the result to the Instruction Memory by enqueueing it on the outgoing `f_IMem_req` FIFO. The other end (dequeue end) of the FIFO is in the module interface. The outer environment of this module will dequeue it and forward it to memory.

### 11.5.2 FSM action for Decode

```
src_Drum/CPU.bsv: line 175 ...
1 Action a_Decode =
2   action
3     let mem_rsp <- pop_o (to_FIFOF_0 (f_IMem_rsp));
4     let y           <- fn_Decode (rg_Fetch_to_Decode, mem_rsp, rg_flog);
5     rg_Decode_to_RR <= y;
6     ...
7   endaction;
```

We pop the response from IMem (bind `mem_rsp` to the value at head of the FIFO, and also remove it from the FIFO), then apply `fn_Decode()` (Section 7.3) to the `Fetch_to_Decode` value from the Fetch step and the IMem response, and store the function result in register `rg_Decode_to_RR`.

### 11.5.3 FSM action for Dispatch

The next FSM action is the Register-Read and Dispatch step:

```
src_Drum/CPU.bsv: line 185 ...
1 Action a_Register_Read_and_Dispatch =
2   action
3     // Read GPRs
4     // Ok that read_rs1 and read_rs2 may return junk values
5     //           since not all instrs have rs1/rs2.
6     let x      = rg_Decode_to_RR;
7     let rs1_val = gprs.read_rs1 (instr_rs1 (x.instr));
8     let rs2_val = gprs.read_rs2 (instr_rs2 (x.instr));
9
10    Result_Dispatch y <- fn_Dispatch (x, rs1_val, rs2_val, rg_flog);
11    rg_Dispatch      <= y;
12    ...
13  endaction;
```

Using the information in `rg_Decode_to_RR` created in the Decode step, we read the two registers `rs1` and `rs2`. We apply the function `fn_Dispatch` (Section 7.4) and store the result in register `rg_Dispatch`.

Note that we blindly read both registers `rs1` and `rs2`, even though many instructions do not have one or both of these fields. In those situations we'll be reading bogus/irrelevant values, but it does not matter—in the Execute step we will only use these values in instructions that need them. Recall that while in a software programming language this may seem unnecessary or wasted work, in this hardware context nothing is wasted—the hardware for reading both registers exist, there is nothing lost in using it.

### 11.5.4 FSM actions for Execute and Retire

After the Dispatch FSM step, we move on to the Execute and Retire FSM steps. Figure 11.2 shows the details of the four “flows” that may follow. Each of the flows can have an exception



Figure 11.2: Execute and Retire actions in Drum

or a normal (OK) result. The Direct flow may need to execute a SYSTEM instruction (ECALL, EBREAK, MRET and CSRRxx). Executing a CSRRxx instruction may also have an exception or normal result. For each flow, the colored ovals show the actions may be performed; these correspond exactly to the three help-functions discussed earlier in Section 11.4.

We next elaborate each of these four flows.

#### 11.5.4.1 FSM actions in Direct flow of Execute and Retire

```
src_Drum/CPU.bsv: line 215 ...
1 Action a_Retire_direct =
2   action
3     let x_direct = rg_Dispatch.to_Retire;
4     if (x_direct.exception) begin
5       fa_setup_exception (x_direct.pc,           // epc
6                           x_direct.cause,        // cause
7                           x_direct.tval);      // tval
8       log_Retire_Direct_exc (rg_flog, x_direct);
9     end
10    // -----
11    else if (is_legal_CSRRxx (x_direct.instr)) begin
12      match { .exc, .rd_val } <- csrs.mav_csrrxx (x_direct.instr,
13                                                    x_direct.rs1_val);
14      if (exc)
15        fa_setup_exception (x_direct.pc,           // epc
16                            cause_ILLEGAL_INSTRUCTION, // cause
17                            x_direct.instr);        // tval
18      else begin
19        fa_update_rd (x_direct, rd_val);
20        fa_redirect_Fetch (x_direct.fallthru_pc);
21      end

```

```

22         log_Retire_CSRRxx (rg_flog, exc, x_direct);
23     end
24     // -----
25     else if (is_legal_MRET (x_direct.instr)) begin
26         fa_redirect_Fetch (csrs.read_epc);
27         csrs.ma_incr_instret;
28         log_Retire_MRET (rg_flog, x_direct);
29     end
30     // -----
31     else if (is_legal_ECALL (x_direct.instr)
32             || is_legal_EBREAK (x_direct.instr))
33     begin
34         let cause = ((x_direct.instr [20] == 0)
35                     ? cause_ECALL_FROM_M
36                     : cause_BREAKPOINT);
37         fa_setup_exception (x_direct.pc,      // epc
38                             cause,
39                             0);           // tval
40         csrs.ma_incr_instret;
41         log_Retire_ECALL_EBREAK (rg_flog, x_direct);
42     end
43     else begin
44         wr_log2 (rg_flog, $format ("CPU.EX.Direct: IMPOSSIBLE"));
45         $finish (1);
46     end
47 endaction;

```

First, for convenience, we give the value `rg_Dispatch.to_Retire` a shorter and more convenient name, `x_direct`, to be used in the rest of this `action-endaction` block.

If we received an exception from the Dispatch step, we invoke the help-function `fa_setup_exception()` (Section 11.4) to set register `rg_exception` to true and to record the relevant values in the registers `rg_epc`, `rg_tval`, `rg_cause` and `rg_tval`.

If we received a CSRRxx instruction from the Dispatch step, we invoke the `mav_csrrxx()` method in the `csrs` module to perform the instruction on the relevant CSRs. The result is a 2-tuple (Section 6.3) and we use a `match` construct bind the names `exc` and `rd_val` to the two components. If the CSRRxx instruction failed (`exc` is true), once again we invoke `fa_setup_exception()` to record this. Otherwise, we invoke `fa_update_rd()` to write `rd_val` to a GPR, and we invoke `fa_redirect_Fetch()` to update the PC to the fall-through PC value.

If we received an MRET instruction from the Dispatch step, we simply invoke `fa_redirect_Fetch()` to resume executing at the PC value provided in `rg_mtvec`.

If we received an ECALL or EBREAK instruction from the Dispatch step, recall from Section 2.7.1 that these instructions merely invoke exceptions, with “cause codes” `cause_ECALL_FROM_M` and `cause_BREAKPOINT`, respectively (Section 2.7 discussed cause codes). We simply invoke `fa_setup_exception()` to record this. Inside this clause, since we are already assured that the current instruction is either ECALL or EBREAK, we only need to check one bit (`instr[20]`) to distinguish them in order to select the correct cause code.

There is a final “else” clause, but between the functionality of the Decode and Dispatch

steps, it should be impossible to enter this clause (this is just a bit of defensive programming and can safely be removed; even if not removed, it does not produce any hardware).

#### 11.5.4.2 Counting retired instructions in CSR `minstret`

The RISC-V specification says that there is a CSR called `minstret` that keeps a count of all retired instructions. This is often used for performance measurement; for example `minstret` divided by time gives us a measure of CPU speed, “instructions/second”.

There are only two nuances for which we must take special care. First, any instruction that raises an exception is considered to be “not retired”, so we do not increment `minstret` in that case.

Second, suppose `minstret` has the value 1000, and the next instruction we retire is a CSRRxx instruction that writes 600 into CSR `minstret`. Should the final value of `minstret` be 1001? Or 600? Or 601? Or something else? The RISV-V spec says it should be 600, *i.e.*, an explicit CSRRxx write into `minstret` should override any implicit increment of the CSR.

In the Drum code, we invoke:

```
csrs.ma_incr_instret;
```

to increment `minstret` inside the `csrs` module in almost all cases:

|                                  |                                               |
|----------------------------------|-----------------------------------------------|
| in <code>a_Retire_direct</code>  | for MRET, ECALL and EBREAK but not for CSRRxx |
| in <code>a_Retire_Int</code>     | when not an exception                         |
| in <code>a_Retire_Control</code> | when not an exception                         |
| in <code>a_Retire_DMem</code>    | when not an exception                         |

In the remaining case for `a_Retire_direct`, CSRRxx, we invoke this method—`csrs.mav_csrrxx()`—which performs the CSRRxx instruction, and increments `minstret` if there is no exception and if it did not write to CSR `minstret`.

#### 11.5.4.3 FSM actions in Execute and Retire Control flow

The Control path is a sequence of two actions. First, we have an Execute action invokes `fn_EX_Control` (Section 7.5) on the information provided by Dispatch, and stores the result in register `rg_EX_Control_to_Retire`:

```
src_Drum/CPU.bsv: line 264 ...
1 Action a_EX_Control =
2   action
3     let x = rg_Dispatch.to_EX_Control;
4     let y <- fn_EX_Control (x, rg_flog);
5     rg_EX_Control_to_Retire <= y;
6     ...
7   endaction;
```

Then, the Retire action checks if Execute resulted in an exception or not. If an exception, it invokes `fa_setup_exception` to record that. If not an exception, it invokes `fa_update_rd`

to store the output value (this can only be a “return address” computed for JAL and JALR), and it invokes `fa_redirect_Fetch` to resume execution at the `next_pc` value computed by `BRANCH/JAL/JALR`.

```
src_Drum/CPU.bsv: line 274 ...
1 Action a_Retire_Control =
2   action
3     let x_direct = rg_Dispatch.to_Retire;
4     let x_control = rg_EX_Control_to_Retire;
5     if (x_control.exception)
6       fa_setup_exception (x_direct.pc,
7                           x_control.cause,
8                           x_control.tval);
9     else begin
10       fa_update_rd (x_direct, x_control.data);
11       fa_redirect_Fetch (x_control.next_pc);
12       csrs.ma_incr_instret;
13     end
14     ...
15   endaction;
```

#### 11.5.4.4 FSM actions in Execute and Retire Integer flow

This sub-FSM is similar to the Control sub-FSM: it is also a sequence of two actions. The Execute action invokes `fn_EX_Int` (Section 7.6) on the information provided by Dispatch, and stores the result in register `rg_EX_to_Retire`.

```
src_Drum/CPU.bsv: line 293 ...
1 Action a_EX_Int =
2   action
3     let x = rg_Dispatch.to_EX;
4     let y <- fn_EX_Int (x, rg_flog);
5     rg_EX_to_Retire <= y;
6     ...
7   endaction;
```

Then, the Retire action checks if Execute resulted in an exception or not. If an exception, it invokes `fa_setup_exception` to record that. If not an exception, it invokes `fa_update_rd` to store the output value and `fa_redirect_Fetch` to resume execution at the fall-through PC.

```
src_Drum/CPU.bsv: line 303 ...
1 Action a_Retire_Int =
2   action
3     if (rg_EX_to_Retire.exception)
4       fa_setup_exception (rg_Dispatch.to_Retire.pc,
5                           rg_EX_to_Retire.cause,
6                           rg_EX_to_Retire.tval);
7     else begin
8       fa_update_rd (rg_Dispatch.to_Retire, rg_EX_to_Retire.data);
9       fa_redirect_Fetch (rg_Dispatch.to_Retire.fallthru_pc);
10      csrs.ma_incr_instret;
11    end
```

```
12     ...
13     endaction;
```

#### 11.5.4.5 FSM actions in Execute and Retire DMem flow

This sub-FSM is similar to Control and Integers sub-FSMs: it is also a sequence of two actions. The Execute action just sends the memory-request provided by the Dispatch stage to memory by enqueueing it on the outgoing `f_DMem_req` FIFO.

```
src_Drum/CPU.bsv: line 320 ...
1 Action a_EX_DMem =
2   action
3     Mem_Req y = rg_Dispatch.to_EX_DMem;
4     f_DMem_req.enq (y);
5     ...
6   endaction;
```

Then, the Retire action pops the memory response from the incoming `f_DMem_rsp` FIFO. If the memory returned an exception, we compute the proper cause code and then invoke `fa_setup_exception()` to record it. If the memory did not return an exception, we invoke `fa_update_rd()` to store any loaded value into the GPRs, and invoke `fa_redirect_Fetch()` to resume execution at the fall-through PC.

```
src_Drum/CPU.bsv: line 329 ...
1 Action a_Retire_DMem =
2   action
3     let x_direct = rg_Dispatch.to_Retire;
4     let mem_rsp <- pop_o (to_FIFOF_0 (f_DMem_rsp));
5
6     dynamicAssert ((mem_rsp.rsp_type != MEM_REQ_DEFERRED),
7                   "Mem req not speculative but got DEFERRED mem response");
8
9     Bool exception = ((mem_rsp.rsp_type == MEM_RSP_ERR)
10                      || (mem_rsp.rsp_type == MEM_RSP_MISALIGNED));
11
12     if (exception) begin
13       Bit #(4) cause = ((mem_rsp.rsp_type == MEM_RSP_MISALIGNED)
14                           ? (is_LOAD (x_direct.instr)
15                               ? cause_LOAD_ADDRESS_MISALIGNED
16                               : cause_STORE_AMO_ADDRESS_MISALIGNED)
17                           : (is_LOAD (x_direct.instr)
18                               ? cause_LOAD_ACCESS_FAULT
19                               : cause_STORE_AMO_ACCESS_FAULT));
20
21       fa_setup_exception (x_direct.pc,           // epc
22                           cause,
23                           truncate (mem_rsp.addr)); // tval
24
25     end
26     else begin
27       fa_update_rd (x_direct, truncate (mem_rsp.data));
28       fa_redirect_Fetch (x_direct.fallthru_pc);
29       csrs.ma_incr_instret;
30     end
31     ...
32   endaction;
```

### 11.5.5 FSM actions for exceptions

The final action of the Drum FSM is used in case any of the Retire flows recorded an exception. It invokes the `mav_exception()` method in the CSRs module, and invokes `fa_redirect_Fetch` so that execution resumes at the trap-vector PC provided in CSR `mt_vec`.

```
src_Drum/CPU.bsv: line 361 ...
1 Action a_exception =
2   action
3     Bool is_interrupt = False;
4     Bit #(XLEN) tvec_pc <- csrs.mav_exception (rg_epc,
5                                               is_interrupt,
6                                               rg_cause,
7                                               rg_tval);
8     rg_exception <= False;
9     fa_redirect_Fetch (tvec_pc);
10    ...
11 endaction;
```

## 11.6 The Drum CPU module behavior

Having defined all the actions needed by Drum, we can now assemble them into the Drum FSM. First we define an FSM specification `exec_one_instr` (of type `Stmt`) for executing a single instruction.

Then, we embed that `Stmt` in a next-level `Stmt` in an infinite while-loop, preceded by an `await` action that waits for the initial PC to be set by the `init` method. Finally, we instantiate that `Stmt` using a `mkAutoFSM` module that will implement the behavior starting immediately when the circuit emerges from reset. The FSM never stops, because of the infinite while-loop.

```
module mkCPU (CPU_IFC);
  // STATE
  ...
  // ****
  // BEHAVIOR
  ...

  src_Drum/Drum_FSM.bsv: line 10 ...
1 Stmt exec_one_instr =
2   seq
3     a_Fetch;
4     a_Decode;
5     a_Register_Read_and_Dispatch;
6
7     // Execute and Retire
8     if (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_DIRECT)
9       a_Retire_direct;
10    else if (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_CONTROL)
11      seq // BRANCH, JAL, JALR
12        a_EX_Control;
```

```

13          a_Retire_Control;
14      endseq
15      else if (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_INT)
16          seq    // LUI, AUIPC, IALU
17              a_EX_Int;
18              a_Retire_Int;
19          endseq
20      else if (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_DMEM)
21          seq
22              a_EX_DMem;
23              a_Retire_DMem;
24          endseq
25      else    // IMPOSSIBLE
26          noAction;
27
28      if (rg_exception)
29          a_exception;
30  endseq;
```

```

src_Drum/Drum_FSM.bsv: line 41 ...
1 mkAutoFSM (seq
2     await (rg_running);
3     while (True) exec_one_instr;
4 endseq);
```

```

// ****
// INTERFACE
...
endmodule
```

Recall that each **Action** is instantaneous, and that FSMs sequence actions. In the FSM, the Fetch action **a\_Fetch** takes place in one instant; the Decode action **a\_Decode** takes place at a later instant; the **a\_Register\_Read\_and\_Dispatch** action takes place at an even later instant. After that, the FSM goes down one of four alternative paths (Direct, Control, Int and DMem). Finally, if **rg\_exception** is true, it performs **a\_exception**.

The **a\_Fetch** action tries to enqueue an IMem request to memory and enqueue an **Fetch\_to\_Decode** struct on an output FIFO. The action is not “enabled” until both FIFOs have space available (are not full). When the Fetch action fires, both enqueue are performed, atomically.

The **a\_Decode** action tries to dequeue an IMem response from memory, dequeue a **Fetch\_to\_Decode** struct from an input FIFO, and enqueue a **Decode\_to\_RR** struct on an output FIFO. The action is not “enabled” until the input FIFOs have data available (are not empty) and the output FIFO has space available (is not full). When the Decode action fires, both dequeues and an the enqueue are performed, atomically.

Thus, the time interval between the Fetch instant and the Decode instant is unpredictable; it depends on when input FIFOs are not empty and output FIFOs are not full (see dicussion in Section 3.2.1 on unpredictable memory latency).

NOTE:

Looking ahead to a topic we'll discuss in more detail in Chapter 14, each module interface method has a so-called *implicit condition*, *i.e.*, an accompanying boolean value value that indicates when the method is READY or not or, to say it another way, whether the method is ENABLED or not. For the FIFO methods “`.first`” and “`.deq`”, which are used by “`pop_o`” above, the methods are ready/enabled only when the FIFO is not empty. The “`enq`” method is ready/enabled only when the FIFO is not full.

The Decode action in the FSM is translated by the *bsc* compiler into a BSV *rule*. A rule does not “fire” until all the implicit conditions in its method-calls are enabled. This is why the Decode action implicitly waits until something is available in the FIFO.

### Exercise 11.1:

What might happen if we omitted the “`await!(rg_running)`” statement in the Drum CPU? (Try it in simulation!)

*Hint:* The FSM may start running before the PC has been initialized ...

□

## 11.7 Conclusion

And that is the complete Drum CPU! We can compile it with *bsc* into Verilog, connect the CPU interface to a memory system, and run the system to execute RISC-V programs compiled for the RV32I ISA subset (and, further, Drum can recover from illegal instructions due to trap-handling).

### 11.7.1 But Drum code looks just like C!? Why not code it in C?

Looking at the BSV FSM code for Drum, it seems like we could convert it into C code by simply replacing a few BSV idioms with corresponding C idioms (like replacing `begin-end` and `action-endaction` with “{” and “}”, replacing BSV register declarations with C variable declarations, *etc.*). Similarly, each of the pure functions `fn_Fetch`, `fn_Decode`, ... can easily be slightly tweaked into C functions.

This can indeed be done, and the result would be a C simulator for RISC-V! We could compile it with any C compiler and run the software on any platform.

So, why not code Drum in C instead of BSV, and change the *bsc* compiler to accept C syntax? It is difficult to compile general-purpose C into hardware. C has many constructs that have no obvious mapping into hardware, such as pointers, the address-of operator “&”, address arithmetic, `malloc`, and so on. But can't we define a restricted subset of C suitable for hardware design? Compilation is still very difficult. In the suggested re-coding in C above, we lose the distinction between variables used to hold state (registers, register files, FIFOs) and ordinary variables for temporary values (which in hardware are just wires). We also lose the distinction between non-temporal statements (statements within an action block) versus temporally sequenced (actions in an FSM). A C-to-hardware compiler would have to reconstruct this information.

But the killer reason is sequentiality *vs.* concurrency. With Drum it is deceptively easy to imagine a simple correspondence with C, because Drum is a sequential FSM, and C is a sequential language. When we advance to concurrent FSMs, including pipelined processors like Fife, we depart substantially and deeply from sequential process semantics, and C becomes more and more unsuitable for hardware description.

NOTE:

There are products in the marketplace under the rubric of “High Level Synthesis” (HLS) that translate C source codes into synthesizable Verilog. These indeed work on a carefully defined subset of C, not the full C language. They work best on simple programs that contain nested loops working on dense, rectangular arrays (many image-processing and linear algebra applications can be so characterized), because the compiler is able to analyze the originally sequential semantics of such programs and transform them into highly parallel, highly structured representations that can then be mapped into very stylized hardware (data path plus control FSM). Even with this capability, the programmer may need to provide many additional directives to the compiler to guide it towards good quality hardware. We do not consider this to be a general purpose approach to hardware design.

See also Appendix [B](#) for more discussion on this topic.



# Chapter 12

## BSV: Verifying BSV designs

### 12.1 Introduction

In this chapter we describe general techniques used to verify BSV designs of any kind. The next chapter focuses on techniques for verification of CPU implementations.

### 12.2 BSV: Testbenches and DUTs

To debug a BSV design, we typically set up a system similar to that shown in Figure 12.1. The BSV design that is being tested is usually called the “Design Under Test”, or DUT



Figure 12.1: A testbench connected to a DUT

for short. The surrounding and/or adjacent modules are called the “Testbench” (or “Test Harness” or “Test Environment”).

The top-level module, `mkTop`, has the `Empty` interface, which is just an interface with no methods, and is pre-defined in the `bsc` library.

The testbench interacts with the DUT *via* its interface methods. The DUT interface and the testbench interface are often opposite “duals”, *e.g.*, if `DUT_IFC` has `FIFO_I` sub-interface for input data, the testbench might have a corresponding `FIFO_O` that delivers that input data; these are connected together by `mkTop`.

Code inside the testbench produces input data for the DUT; this part of the testbench is called the “stimulus generator”. Stimulus data can be generated in the testbench, or read from data files.

Other code inside the testbench collects output data from the DUT and checks if it has the expected value corresponding to the stimulus. This part of the testbench is called a “checker”. Output data can be checked immediately in the testbench, or recorded into files for offline manual or automated checking. Alternatively, the DUT may just print outputs to the screen (during simulation), which can be visually examined for correctness or recorded for offline manual or automated checking.

BSV designers typically write their testbenches in BSV. If the testbench is only used in simulation (not in actual hardware, FPGA or ASIC) then it can read and write files. It can also import C code (see Appendix D) to reuse existing C algorithms and models, and to have full access to operating system services (files, networking, *etc.*).

In the SystemVerilog community there is a mature standard methodology for testbenches called UVM (Universal Verification Methodology) which exploits the “object-oriented programming” aspects of SystemVerilog for reusability (see Glossary C for some more detail). BSV designs can also use UVM testbenches. The whole of Figure 12.1 can be in Verilog/SystemVerilog, with the DUT using the Verilog produced from a BSV design using the *bsc* compiler.

### 12.3 BSV: “printf”-style Debugging

A popular style of debugging BSV designs is the same as in debugging software in any programming language: insert “print” statements in the BSV code at various places to print out values of interest during simulation. We examine these outputs to identify suspicious values and then perhaps insert and remove print statements to zero-in on the exact place in the BSV code where something wrong was computed.

BSV, like Verilog and SystemVerilog, has the following built-in functions to write to files during simulation, analogous to “printf” in C. All of them have Action type, and so they can occur in any Action context: bodies of rules, bodies of Action and ActionValue methods, bodies of Action and ActionValue functions.

```
$write   (      format-string, arg, ..., arg )
$display (      format-string, arg, ..., arg )

$fwrite  ( file, format-string, arg, ..., arg )
$fdisplay ( file, format-string, arg, ..., arg )
```

The first two write to “standard output” (*i.e.*, the terminal), and the latter two write to a specific file which has previously been opened with an Action statement like this:

```
file <- $fopen ("log.txt", "w");
```

The difference between “write” and “display” is merely that the latter appends a newline at the end of the output.

These are similar to C’s `printf` and `fprintf` functions. The format string is a string (in double-quotes) with formatting directives for the arguments that follow (%d for signed integers, %b for binary numbers, %h for hexadecimal numbers, *etc.*).

These BSV statements for printing only exist in simulation code. They are omitted completely by synthesis tools that target actual hardware (ASIC or FPGA).<sup>1</sup>

### 12.3.1 FShow for “pretty-printing” enums and structs

In any enum or struct type declaration in BSV, one can attach a “`deriving(FShow)`” clause to request the `bsc` compiler to define an “`fshow`” function for the type, with some default formatting. For example, the Drum/Fife code has such a clause in the declaration of the `Decode_to_RR` struct type:

```
src_Common/Inter_Stage.bsv: line 49 ...
1  typedef struct {Bit #(XLEN)  pc;
2
3      Bool          exception; // Fetch exception/ decode illegal instr
4      Bit #(4)       cause;
5      Bit #(XLEN)   tval;
6
7      // If not exception
8      Bit #(XLEN)   fallthru_pc;
9      Bit #(32)     instr;
10     OpClass      opclass;
11     Bool          has_rs1;
12     Bool          has_rs2;
13     Bool          has_rd;
14     Bool          writes_mem; // All mem ops other than LOAD
15     Bit #(XLEN)   imm;        // Canonical (bit-swizzled)
16     ...
17 } Decode_to_RR
18 deriving (Bits, FShow);
```

The result-type of `fshow()` is of type “`Fmt`”, and this type is also allowed as an argument to `$display` (and its variants): Example:

```
1  Decode_to_RR y = ...
2  Fmt           f = fshow (y);
3  $display (      "Decode result is ", f);
4  $fwrite (file, "Decode result is ", f);
```

### 12.3.2 Fmt formatted values

BSV’s formatting facilities are actually more powerful than `printf` in C/C++ or `$display` in Verilog/SystemVerilog.

---

<sup>1</sup>This is not because of any fundamental synthesizability difficulty, it is only because there is no standard concept of “file” or “output stream” in hardware designs, not even a serial port.

The result-type of `fshow()` is a standard type in BSV called `Fmt`, and this is also an acceptable argument in BSV's `$display` functions (as demonstrated above in the last section).

We can create new `Fmt` objects using the built-in pure function `$format()`, and we can combine them (concatenate them) using an infix "+" operator. `Fmt` is a first-class type, so we can bind it to variables, pass it as arguments and results of functions, and so on. The following example illustrates defining another function to format a `Decode_to_RR` struct type, an alternative to `fshow` to format it in some other preferred way:

```
src_Common/Inter_Stage.bsv: line 214 ...
1  function Fmt fshow_Decode_to_RR (Decode_to_RR x);
2      Fmt f = $format ("    Decode_to_RR{");
3      f = f + $format ("I_%0d", x.inum);
4      f = f + $format (" pc:%08h", x.pc);
5      f = f + $format (" instr:%08h", x.instr);
6      f = f + $format (" pred:%08h epoch:%0d\n", x.predicted_pc, x.epoch);
7      f = f + $format ("          ");
8      f = f + $format ("fallthru:%08h", x.fallthru_pc);
9      if (x.exception) begin
10          f = f + $format (" ", fshow_cause (x.cause));
11          f = f + $format (" tval:%0h", x.tval);
12      end
13      else begin
14          f = f + $format (fshow (x.opclass));
15          f = f + $format (" has_{rs1,rs2,rd}:{%0d,%0d,%0d} writes_mem:%0d, imm:%0h",
16                           x.has_rs1, x.has_rs2, x.has_rd, x.writes_mem, x.imm);
17      end
18      f = f + $format ("}");
19      return f;
20  endfunction
```

Note the use of if-then-else to customize the `Fmt` object according to the actual data in the struct (which would not happen in the default `fshow()`, which would simply print all the struct fields). This function can be used just like, and in place of, `fshow`:

```
1  Decode_to_RR y = ...
2  Fmt           f = fshow_Decode_to_RR (y);
3  $display (     "Decode result is ", f);
4  $fwrite (file, "Decode result is ", f);
```

This shows one immediate advantage of the `Fmt` facilities: we can format something once and write it to multiple output streams. By abstracting formatting into a common function, it can be modified easily and all `$display` statements using it can share the benefit.

A second advantage is that actual formatting code, which is often quite verbose, *ad hoc* and messy with a lot of fragments of character strings, can be lifted to a different location in the code, keeping the `$display` statement short and sweet.

## 12.4 BSV: Dynamic assertions

The BSV libraries offer a package:

```
import Assert :: *;
```

which contains the following function:

```
function Action dynamicAssert(Bool b, String s);
```

This can be used in any Action context (*e.g.*, a rule body) to check an expected property each time that Action context is executed. For example the Drum CPU code includes the following excerpt:

```
src_Drum/CPU.bsv
1 Action a_Retire_DMem =
2   action
3     ...
4     let mem_rsp <- pop_o (to_FIFOF_0 (f_DMem_rsp));
5     dynamicAssert ((mem_rsp.rsp_type != MEM_REQ_DEFERRED),
6                     "Mem req not speculative but got DEFERRED mem response");
7     ...
8   endaction
```

which checks for the unexpected situation where the DMem memory response was “deferred” (deferred memory responses are only possible for speculative memory accesses, which only occur in Fife and not in Drum). Every time this action is executed (on every DMem response), the boolean condition is tested and, if false, it aborts the simulation after printing the associated string.

These `dynamicAssert` statements have no cost in final real hardware because, like `$display` statements, they exist only in simulation code. Thus, one should not hesitate to use them liberally.

## 12.5 BSV: Waveform-style debugging

Many hardware designers like to debug designs using “waveforms”, which are a graphical display of how values on buses (bundles of wires) in the design vary over time.

All Verilog, SystemVerilog and VHDL simulators have a facility to write out a “Value Change Dump” (VCD) file, which is a record of how each bus (bundle of wires) in the design changed over time (measured with clock ticks). VCD files can then be viewed as a graphical display in any waveform viewer. Waveform viewers are bundled with most commercial RTL simulators, but the free and open-source *gtkwave* viewer is also popular.

When simulating in a Verilog simulator, most simulators have command-line or interactive controls to switch VCD dumping on or off.

When simulating in Bluesim interactively, the commands:

|                          |                           |
|--------------------------|---------------------------|
| <code>sim vcd on</code>  | enables writing out VCDs  |
| <code>sim vcd off</code> | disables writing out VCDs |

VCD dumping can also be controlled from within a BSV program, using these three Actions:

|                   |                          |
|-------------------|--------------------------|
| <b>\$dumpvars</b> | Starts writing out VCDs  |
| <b>\$dumpoff</b>  | Stops writing out VCDs   |
| <b>\$dumpon</b>   | Resumes writing out VCDs |

# Chapter 13

## RISC-V: Functional verification of CPUs

### 13.1 Introduction

The last chapter described general techniques used to verify BSV designs of any kind. This chapter focuses on techniques for verification of CPU implementations.

Debugging and verifying a CPU implementation (whether it is a hardware implementation or a software simulator) is hard because, as discussed in Chapter 3, it is not just a program, but is itself an *interpreter* of programs. Thus we are confronted with two levels, not one. The first-level program (P1) is the CPU implementation (the BSV program), which we are simulating or running in hardware. P1, in turn, is interpreting the RISC-V program (P2) that was loaded into the RISC-V CPU's memory. When we observe an error in P2's outputs or behavior, it could be a bug in P2 (the RISC-V program), or a bug in P1 (our RISC-V implementation), or both.

In other words, when we have an interpreter of programs, we conceptually have an exponentially larger space of things to test (test cases). Every RISC-V program P2 is potentially a test case for our implementation P1, along with every potential input to each P2!

In addition to this large space of potential tests, the duration of a test can be very long. A bug may exhibit itself only inside an application program running under an operating system. To reach this point we may have to execute hundreds of millions, or billions of P2 instructions on P1.

Fortunately, for small to medium-sized isolated test programs, we can exploit a key property of the RISC-V ISA (and most ISAs): except for interrupts, which are asynchronous events, and a few instructions (such as `rdtime` and `rdcycle`, the sequence of instructions executed by a RISC-V program on a particular initial memory contents is completely deterministic and repeatable. This means that if we execute a RISC-V program P2 repeatedly in our test setup, or even run P2 on different implementations (such as a software simulator and Drum and Fife), they should all exhibit *exactly* the same sequence of instructions, or “*instruction trace*”. If one of them takes a particular conditional BRANCH, then the others should, as well. If one of them traps due to, say, an illegal instruction, then the others should, as well, and they all should vector to exactly the same instruction address for the trap handler. If

one of them traps due to, say, an unimplemented memory location, then the others should, too, assuming they have the same (or equivalent) memory setup. We call this a *functional equivalence* because different implementations perform the same function on such a test program, even though they may execute it at vastly different speeds.

Beyond small and medium-sized deterministic test programs, we need to test the CPU in an environment that more closely resembles its eventual operating environment. This may include running operating systems and interacting with devices that may generate timer and device interrupts asynchronously, *i.e.*, at unpredictable moments in time relative to instruction execution. This topic is discussed in Section 13.6.

This chapter only addresses the question of *functional correctness* (“Does the CPU compute correct data?”). An equally important topic is *performance correctness* (“Does the CPU compute results in the allowed/expected time?”). That topic is addressed in Chapter 19.

## 13.2 Trusted functional simulators (“golden reference models”)

Most CPU design teams rely on a “trusted” software functional simulator that can execute RISC-V programs. Such a simulator is written, and is maintained, with a focus that prioritizes clarity and correctness, not speed. This simulator acts as a reference standard, and is also called a “Golden Reference Model”. It answers questions such as:

- What is a correct execution when executing a RISC-V program P2?
- What is the correct state of the registers and memory before and after instruction  $n$ ?

We can compare our hardware implementation against this reference to see if our hardware implementation is also behaving correctly.

Often, a trusted simulator is the first thing one implements when designing a new ISA, before starting any hardware design. Once the ISA is frozen, because the functional simulator is only defining functional correctness (not speed), it can be used across multiple hardware CPU design projects, *i.e.*, it is a long-lived resource that can be shared across multiple projects.

For the RISC-V ISA, there are several free and open-source software functional simulators available that we can use as a trusted reference. Appendix A.3 provides URL links to the two most well-known, the *Spike* simulator and the *Sail* simulator. Each of them can be configured for a particular subset of the RISC-V ISA (such as RV32IM, or RV64IMAFDC, with or without privilege levels and virtual memory, *etc.*), and then run RISC-V programs, producing a log/trace of the instructions executed.

It can be useful to be able to modify and customize a trusted software functional simulator. For example, if we are extending the RISC-V ISA with new, custom instructions, then we would like the simulator to support them. Or, we may want extra information detail in the simulator’s output trace that was not originally provided. Or, we may use the simulator in non-standard ways, such as the asymmetric tandem verification mode described in Section 13.6.

Being open-source, the Spike and Sail simulators can be modified for custom purposes. Many companies also have their own trusted software RISC-V functional simulators for greater modifiability and customization (such as *Cissr* from Bluespec, Inc.).

Golden reference models are usually written as RISC-V interpreters in a software programming language (for example, Spike is written in C++). In this case they only run as software on a standard computer.

Golden reference models can also be written in a high-level hardware design language, such as BSV. For example, we can think of Drum in this way. Because it is a simple FSM, it is extremely simple and clear. The advantage of a hardware golden reference model is that it can be synthesized into hardware and plugged into any system environment where a more powerful CPU implementation will be go.

### 13.3 RISC-V test programs for verification

An advantage of an open (non-proprietary) ISA like RISC-V is that there is a vast and growing worldwide community of hardware CPU developers—companies, universities, research organizations, students and hobbyists—who share a common need for RISC-V test programs for verification. Many of the test programs they create are shared in free and open-source form.

#### 13.3.1 ISA tests

The oldest and most stable set of test programs originated the the University of California Berkeley with the original RISC-V team, and is now maintained and expanded by RISC-V International (RVI) on their GitHub site at <https://github.com/riscv-software-src/riscv-tests>. These are a set of several hundred test programs written in RISC-V Assembly Language. Each program contains a small set of tests for some particular instruction in the RISC-V ISA. The tests cover RV32I and RV64I and the A, M, F, D and C extensions; machine, supervisor and user extensions; and some system features like PMP (Physical Memory Protection). Most user-level tests can be built to run with or without virtual memory. The tests are provided in source-code form (RISC-V Assembly Language) and need to be compiled with a C compiler (*e.g.*, *gcc*) to produce ELF binaries.

The tests are small (a few thousand instructions) and are all *self-checking*, *i.e.*, they compute something and then check if the result has the expected value. At the end of the test, each program outputs a PASS/FAIL indicator. PASS means it passed all the tests in the file. In the FAIL case, it also outputs the test number within the file that failed.

The tests are small enough that it is feasible manually to compare a trace from our CPU implementation against the assembly-language source code (or the assembly language in the *gcc*-produced “objdump” disassembly of the ELF file) and analyze it for errors.

The tests also run sufficiently quickly (a few minutes at most) that we routinely and automatically rerun all the ISA tests every time we make any significant change to the CPU design’s source code or test environment.

### 13.3.2 ACTs and other test suites

RVI (RISC-V International) is developing a set of tests called ACTs (Architecture Compatibility Tests). Each test is a RISC-V program that runs and produces a final “signature”, which is a reliable hash of the final state (PC, registers, CSRs, memory, *etc.*).

If a candidate RISC-V CPU implementation claims to support, say, RV32IM, then it must run the relevant ACTs for RV32IM, and the produced signatures should match the official signature for that test published by RVI (the official signatures, in turn, are produced by RVI by running on a trusted simulator such as Spike or Sail).

This is an ongoing project (and always looking for volunteers!). Some relevant links:

<https://wiki.riscv.org/pages/viewpage.action?pageId=49872986>

<https://github.com/riscv-non-isa/riscv-arch-test>

<https://riscosf.readthedocs.io/en/stable/>

Another test suite, called “riscv-dv”, was developed initially at Google; subsequent stewardship was taken over by the ChipsAlliance non-profit group. These tests generate “random” RISC-V programs (constrained by a certain ISA subset), run the program on a candidate implementation which should generate a trace, and compare against a trace from a trusted functional simulator:

<https://github.com/chipsalliance/riscv-dv>

### 13.3.3 What does “verified” mean? Levels of assurance; coverage

In the software community, the term “verified program” has historically meant “formally verified program”, *i.e.*, that we have a *proof of correctness* of the program against a formal specification of the program. A formal specification is usually written in a specialized formal specification language which are usually highly mathematical, declarative languages. A formal proof of correctness uses formal tools (theorem provers, mechanized logics) to show that a candidate implementation correctly computes something prescribed by the formal spec. Formal verification may not involve executing the implementation at all, just analysis of the code for the implementation. Formal verification may be partial (prove that the implementation is correct for certain inputs) or complete (for all inputs). Although there has been huge progress in formal verification over the decades, as of 2024 it is still not able to handle very large and complex programs; it is used very successfully for individual or small collections of modules, and for partial correctness.

The hardware community has also been exploring formal verification of hardware implementations, and there are several tools offered commercially. But, as in the software community, as of 2024 it is still not able to handle very large and complex designs; it is used quite successfully for individual or small collections of modules, and for partial correctness.

In the hardware community, the term “verification” has historically meant “extensive testing with a large suite of test programs”, not formal verification. One can think of the term “level of assurance” as a point on scale ranging from “no assurance” (untested, unproved) to “full assurance” (formally proved correct).

As we increase the number and variety of test programs that we run (the test suite), we may increase the level of assurance that we have a correctly implemented design. In the rare and unlikely case that the set of inputs is small enough that we can test all of them, we essentially have full assurance, or a proof of correctness by enumeration. Normally, the space of possible tests is too large even to enumerate, let alone run. Thus the test suite needs to be carefully engineered and selected to *cover* as much of the space as feasible, particularly the space of inputs expected when the design is deployed in the field.

The term “coverage” is a kind of measure of level of assurance. It can be used both in a functional way (“How much of the space of possible inputs has been tested?”) or in a more implementation-specific way (“What fraction of the lines of code in the BSV/Verilog/SystemVerilog source have we exercised with our tests?”).

Here is a sequence of testing regimes with increasing levels of assurance of the correctness of our CPU implementation:

- Run all the standard “ISA tests” mentioned in Section 13.3.
- Run the ACTs and riscv-dv test suites mentioned in Section 13.3.
- Run a small operating system (such as FreeRTOS or Zephyr). This will check correct handling of timer interrupts, and possibly memory maps and physical memory protection (PMPs).
- Run the kernel of a full-service operating system (such as the Linux kernel).
- Run a standard distribution a full-service operating system (such as Debian Linux or Ubuntu Linux), *i.e.*, the OS kernel *plus* the pre-load of all the applications and service programs that come with distribution (including block devices and networking).

## 13.4 A testbench for Drum and Fife

Figure 13.1 illustrates the structure of the testbench provided with this book for Drum and Fife. On the left we see the top-level of the module hierarchy, and on the right we see some



Figure 13.1: Top-level simulation setup for the Drum and Fife CPUs

module interactions. The top-level module, `mkTop`, has the `Empty` interface.

The interface for Drum and Fife, `CPU_IFC`, was described in Section 11.2. The most important sub-interfaces are the FIFOs carrying `IMem` and `DMem` memory requests and responses. `mkTop` instantiates the `mkCPU` module (which can be either Drum or Fife) and `mkMems_Devices` modules. An excerpt from `mkTop` is shown below:

```
src_Top/Top.bsv: line 30 ...
1 (* synthesize *)
2 module mkTop (Empty);
3   ...
4   // Instantiate the CPU
5   CPU_IFC cpu <- mkCPU;
6
7   // Instantiate the memory model
8   Mems_Devices_IFC mems_devices <- mkMems_Devices (cpu.fo_IMem_req,
9                                                 cpu.fi_IMem_rsp,
10                                            ...
11                                            cpu.fo_DMem_req,
12                                            cpu.fi_DMem_rsp);
```

Note that we pass the CPU's `IMem` and `DMem` sub-interfaces directly to `Mems_Devices` module as module parameters; as a result, rules in `mkMems_Devices` can directly access the `IMem` and `DMem` FIFO interfaces of the CPU to collect `IMem` and `DMem` requests and send back `IMem` and `DMem` responses.

The `mkMems_Devices` module implements models for memory, a UART (Universal Asynchronous Receiver/Transmitter, also known as a serial port), a real-time clock, and any other devices expected by the CPU and the RISC-V code running on the CPU. Some of these (memory, UART, Fife store-buffer) are implemented by importing C code.

The memory system in the testbench also needs a way to be pre-loaded with the binary RISC-V code of the program that we want the CPU to execute, before the CPU begins executing.

A rule in `mkTop` that fires at the beginning of simulation invokes the methods `cpu.init` and `mems_devices.init`, passing them a struct with some initialization parameters, such as the reset value for the PC (address from which the first instruction will be fetched), and a file descriptor into which logs should be written.

```
src_Top/Top.bsv: line 73 ...
1 // Initialize modules
2 rule rl_step1 (rg_top_step == 1);
3   let init_params = Initial_Params {flog: rg_logfile,
4                                     pc_reset_value: 'h_8000_0000};
5   cpu.init (init_params);
6   mems_devices.init (init_params);
7   ...
```

Another rule in `mkTop` fires on every clock. On each clock, it retrieves a value representing the wall-clock time from `mems_devices` and relays it into the CPU:

```

src_Top/Top.bsv: line 110 ...
1 (* fire_when_enabled, no_implicit_conditions *)
2 rule rl_relay_MTIME;
3     let t <- mems_devices.rd_MTIME;
4         cpu.set_TIME (t);
5     endrule

```

Other than that, `mkTop` plays no further role in execution. Rules inside `mkCPU` operate the CPU, putting out IMem and DMem requests. Rules inside `mkMems_Devices` operate the memory and device models, returning IMem and DMem responses.

We do not intend to describe `mkMems_Devices` in any more detail here, since it is outside the main focus of this book, the Drum and Fife CPUs. Section D has details on how to import C code into BSV. The interested student is welcome to peruse the code in the `src_Top/` directory.

### 13.5 Symmetric Tandem Verification of CPU implementations

In Section 13.1 we mentioned that the instruction trace for a RISC-V program is mostly deterministic and repeatable across RISC-V implementations. We can exploit this property in a setup called *symmetric tandem verification*, illustrated in Figure 13.2.



Figure 13.2: Tandem verification

We load the same ELF RISC-V binary file into the memories of a “trusted” simulator and our DUT (Design Under Test) setup, and have them both run the program. We arrange to have both simulators write out a trace of their respective program executions. Then, we compare the two traces. Any difference in the traces is an indicator of a potential bug in the DUT.

For example, we might find that, at a certain conditional BRANCH instruction in P2, the trusted simulator took the branch whereas our implementation did not. This would indicate either that there is some problem with our implementation of the conditional BRANCH instruction, or that there was a problem in some previous instruction that computed one or both of the register values used by the conditional BRANCH instruction. Thus, debugging is a process of identifying exactly which instruction in our implementation “went wrong”, i.e., did something different from what the trusted simulator did.

### 13.5.1 Configuration

The trusted simulator and the DUT setup should be configured “identically”:

- They should support the same RISC-V ISA subset, so that an illegal instruction in one is also an illegal instruction in the other.
- They should have the same (or equivalent) memory systems, so that an illegal or misaligned memory access in one has the same response in the other.
- They do not need to have the same *temporal* behavior. For example, memory access latency in one need not match memory latency in the other, and the “time” taken to execute any particular instruction in one need not match the “time” taken by the other.

### 13.5.2 Level of detail in traces

Traces can be produced at varying levels of detail. For example, for each instruction executed, we could record:

- just the PC;
- or also the instruction itself;
- or also any values it reads from or writes to registers

More detail means more simulation overhead (slower simulation) to produce the trace. Traces can become very large (*e.g.*, gigabytes when tracing, say, the booting of an OS). But more detail can also provide better “resolution” in identifying the location of a bug. For example, suppose one instruction (I1) computes a wrong value and stores it to memory, where it sits for thousands, perhaps millions of instructions before it is loaded into a register (instruction I2) and, some instructions later, this is used in a conditional BRANCH instruction (instruction I3). The wrong value may cause the trusted simulator and the DUT to diverge, where one takes the branch, the other does not, which we detect because the next PCs are different. If our trace only records PCs, then we will detect a difference only at I3. However if we also record register values, we will detect the difference earlier, at I1 or I2.

One way to reduce excessive detail is to record traces only for a certain window of instructions, say starting at the 10 millionth instruction and for the following one thousand instructions. This would be fine if we were guaranteed that there was no divergence until the 10 millionth instruction, but we have no way of knowing that. A way to address this is:

- Run the trusted simulator without producing any trace until 10M instructions, and record a “snapshot” of *the entire architectural state*—PC, all registers, CSRs, all of memory. Then, continue for 1K instructions, producing a trusted trace.
- Initialize the DUT’s PC, registers and memory with the values in the snapshot, and then let the DUT execute for 1K instructions, producing a DUT trace to compare with the trusted trace.

This requires infrastructure support in the trusted simulator to be able to dump a snapshot, and in the testbench to be able to initialize its state using the snapshot.

### 13.5.3 Online vs. offline tandem verification

The setup in Figure 13.2 can be performed offline or online:

- **Offline:** Each of the two simulators records its trace in a file, and these files are compared later, manually, or with “`diff`”, or with a specialized comparison tool. In fact, for a given test program, the reference simulator need be run only once, and its trace can be saved for future comparisons.
- **Online:** The two simulators are run concurrently (*e.g.*, two processes in an operating system). They each generate their trace into a *stream* such as an operating system “pipe” or “tty” or network connection. A comparison tool runs concurrently as a third process, continually consuming the two trace streams and comparing them immediately.

The online setup requires more infrastructure and tooling, but has some advantages. First, it can abort both simulations as soon as it detects a divergence, so we don’t unnecessarily continue simulating for a long time. Second, since the compare tool is continually consuming the traces, it need not be recorded in a file, thereby eliminating the “trace-file size” problem.

## 13.6 Asymmetric tandem verification and “full-system” verification

We said in Section 13.5 that, except for interrupts (asynchronous events), the two traces (from the trusted simulator and the DUT) should be identical. There are some nuances to this claim.

### 13.6.1 Instructions with non-deterministic results

If a RISC-V program uses the `rdcycle` or `rdtime` instructions, then the results, loaded into registers, will likely be different in the two setups.

### 13.6.2 Reading uninitialized memory

If the memories in the trusted simulator and the DUT setup have not been initialized identically, then a LOAD instruction can return different results in the two setups, even in “bug-free” RISC-V programs. For example, when traversing a C string (one character per byte), the program may only LOAD aligned 8-byte doublewords, for more efficiency. At the end of the string, only a prefix of the 8 loaded bytes may be part of the string, and the remaining bytes may be “uninitialized” or random. The C program may correctly examine only those bytes that are in the string. However, a tandem verifier will not be aware of this, and may compare the full 8 bytes loaded by the trusted simulator and the DUT, and falsely identify a difference because bytes outside the string happen to be different.

### 13.6.3 Devices and interrupts

Simulating the DUT may require running code that interacts with devices or device models because they are fundamental to the CPU's intended applications. Thus, the DUT simulation needs to model more of the "full system" in which it is intended to be used. Devices or device models may be unique or proprietary to the intended application domain. Devices contain memory-mapped locations accessed by the RISC-V program, and may generate interrupts. We cannot expect the trusted simulator (which is often created and maintained by others) to model all these devices.

### 13.6.4 Asymmetric mode

All the above issues can be handled if we take an "asymmetric" view of tandem verification, and if we can configure the trusted simulator to work in "tandem verification mode". This is illustrated in Figure 13.3.



Figure 13.3: Asymmetric tandem verification: dealing with "minor" differences, interrupts (asynchronous events), devices, etc.

Here, the DUT simulation produces a CPU execution trace, including a record of the interrupts received and taken. This trace is fed to the trusted simulator which runs in tandem verification mode. In this mode, before simulating each instruction, it examines the next ITEM in the trace, and:

- If ITEM is an interrupt received (bit set in the DUT CPU's CSR MIP), then also set that bit in the trusted simulator's CSR MIP. This brings the trusted simulator's CSR back "in sync" with the DUT, from where we proceed.
- If ITEM is an interrupt taken: check if the trusted simulator can also take this same interrupt at this time (correctly set interrupt bits in CSR MIP, correctly set interrupt-enable bits in CSRs MIE, MSTATUS etc.). If so, then perform the interrupt-taking actions in the trusted simulator (save values in CSRs mepc, mcause and mtval, update CSR MSTATUS, set the PC to the value in CSR mtvec, etc.). If this interrupt cannot be taken here, report a divergence and stop.
- Otherwise (ITEM is an instruction-execution), perform the next instruction in the trusted simulator and compare results with ITEM.

If the instruction is a LOAD, RDTIME or RDCYCLE and the loaded values are different, report the difference, but do not stop, do a “fixup” and continue:

- Update the loaded value in the trusted simulator to be the same as in the DUT trace. This brings the trusted simulator’s registers back “in sync” with the DUT, from where we proceed.

In summary: the trusted simulator need not model any devices at all, just memory. Instead of generating a trace, it checks the DUT-produced trace for correctness.

### 13.7 Tandem verification with real hardware (FPGA or ASIC)

The discussion above on Tandem Verification assumed that the trusted model and the DUT were both being run in simulation. But, of course, there is nothing simulation-specific about the technique.

If the RISC-V CPU *hardware* is capable of generating traces, and the infrastructure supports streaming that trace out of the hardware to a host machine then, once again, the trace can be compared with that of a trusted simulator.

The trusted model can itself be written in a synthesizable hardware description language like BSV. In fact, the Drum CPU could be a candidate, because it is an extremely simple implementation, not designed for speed, and so possibly trustworthy. In this case, the entire setup in Figure 13.3 could be in hardware, which is likely to run several orders of magnitude faster than any simulation.



# Chapter 14

## BSV: Rules and their Semantics

### 14.1 Introduction

“Rules” are the fundamental constructs in BSV to specify dynamic behavior. Rules appear in the body of BSV modules, and they may invoke methods in interfaces of other modules. An interface method is simply a “mini-rule” that is incorporated into a rule from which it is invoked. Rule semantics can be understood in two incremental steps:

- Semantics of a rule in isolation (Section 14.3)
- Semantics of the collection of rules in a BSV program (Section 14.4)

Finally, the performance of rules (how long does a computation take?) can be understood by understanding how rules are mapped to a clock (Section 14.5)

### 14.2 Syntax of a rule and the data types of its components

Figure 14.1 shows the syntactic structure of a rule, with the keywords `rule` and `endrule`.



Figure 14.1: Syntactic structure of a rule

The rule’s *explicit* condition is an expression of type `Bool`. Recall from Section 5.6.1 that, since it is not of `Action` or `ActionValue` type, it is guaranteed by BSV’s type system therefore to be a pure computation with no side-effects.

The rule body, as a whole, is an expression of type `Action`. It typically consists of multiple statements, including register-writes, FIFO enqueues/dequeues, module method invocations, `let`-bindings, `$displays`, if-then-elses], and so on. Many of the statements/sub-expressions will themselves be of type `Action` or `ActionValue`. The overall action of a rule is the set of all sub-actions performed by a rule.

Both the rule condition and the body may contain invocations of methods in interfaces of other modules. Each method has an “implicit condition”, also of type `Bool` indicating whether the method is currently enabled or not. This implicit method output is also called its READY signal. For example, for a standard FIFO  $f$ , the  $f.\text{first}$  and  $f.\text{deq}()$  methods have implicit conditions that are true only when the FIFO is non-empty. A method like  $f.\text{first}$ , being pure (not `Action` or `ActionValue`), may be invoked both in rule conditions and in rule bodies. A method like  $f.\text{deq}$ , being of type `Action`, can never be invoked in a rule condition, only in a rule body.

The overall rule condition, also known as its “CAN\_FIRE” condition, is a conjunction of the rule’s explicit condition and implicit conditions of any invoked methods (whether those methods are in the rule condition or in the rule body). For example, if a rule invokes  $f.\text{first}$  or  $f.\text{deq}$ , the rule’s CAN\_FIRE condition cannot be true if  $f$  is empty.

### 14.3 Semantics of a rule in isolation

In this section we discuss the semantics of a rule in isolation; in Section 14.4 we will consider the collection of rules that constitute a BSV design.

Each rule can be viewed as a pure function (therefore, a combinational circuit) whose inputs come from various methods and which produces outputs for various `Action` and `ActionValue` methods. Each `Action` or `ActionValue` method has an implicit boolean ENABLE argument (separate from its normal arguments and result). An `Action` or `ActionValue` method *performs* its action if its ENABLE argument is true.

Consider the following rule (artificial, just for illustration, not taken from code of Drum or Fife or any actual design):

```

1   rule rl_compute ((y != 0) && got_x && (f.first == 3));
2     if (y [0] == 1) w <= w + x;
3     x <= x << 1;
4     g.enq (w * f.first);
5   endrule

```

Figure 14.2 illustrates the semantics of this rule in isolation. In the discussion below, we use the words “input” and “output” relative to a method (input to the method, or output from the method).

At the top of the diagram we see outputs from methods feeding the rule: four register-reads<sup>1</sup> and one FIFO `.first`. The latter provides two outputs: a data output from the head of the FIFO (black line) and a READY value (green line) which is the implicit condition of the method, which is true only if the FIFO is not empty. Actually *all* methods have implicit conditions, but a register’s `.read` and `.write` methods are always READY, *i.e.*, their implicit conditions are constant true, so we omit them in diagrams.

At the bottom of the diagram we see three `Action` methods used in the rule: two register-writes and one FIFO `.enq`. Each method has an input data value (black line) and an implicit

---

<sup>1</sup>Recall from Section 8.3.1 mentioning a register in an expression is equivalent to using its `._read` method, and assigning a value to a register is equivalent to using its `._write` method.



Figure 14.2: Semantics of a rule in isolation

input ENABLE value (red line). The FIFO `.enq` method also has a READY output (green line), which is its implicit condition (true only when the FIFO is not full).

In between the top and the bottom is a pure function (therefore, a combinational circuit). On the left half we see that all the explicit-condition expressions are calculated and combined (with “`&&`”), and then further combined with the implicit conditions, to produce the CAN\_FIRE signal. This is fed directly to the `g.enq` and `x._write` ENABLE inputs. The CAN\_FIRE signal is further combined with the calculation of “`y[0]==1`” and this is fed to the ENABLE input of `w._write`.

The data inputs to the three `Action` methods are straightforward calculations from inputs. When the ENABLE input to an `Action` method is true, the method actually performs the action (enqueue into a FIFO, store into a register).

From the diagram we can see that the ENABLEs of `g.enq` and `x._write` are true when FIFO `f` has data available (not empty), when FIFO `g` has space available (not full) and when the explicit condition is true—the rule “fires” and the actions are performed. The ENABLE of `w._write` is true only if `y[0]==1` is also true.

From this description, several things should be clear:

- All `Actions` in a rule are performed *simultaneously*, no matter what textual order they may appear in the rule body. We also say that the actions all occur *in parallel*.
- All `Actions` in a rule are performed *instantaneously*.
- Explicit rule conditions and method implicit conditions are combined to determine whether the rule executes at all.
- Some actions in a rule may be further restricted by if-then-else conditions in the rule.

### 14.3.1 Hardware representation of a rule in isolation

Figure 14.3 overlays standard symbols in digital hardware for registers and FIFOs onto Figure 14.2, to show how the semantics can map to real hardware in a straightforward way.



Figure 14.3: Hardware representation of a rule in isolation

BSV registers map directly into Verilog registers which, in turn, map into “D flip flops”. The `.read` method output is the same as the “Q” outputs of D flip flops. The `.write` method inputs are the same as the “D” and “EN” inputs of D flip flops. Each register also has a “clock” (CLK) input. On each edge of CLK (*e.g.*, on the positive edge, or so-called “posedge”, when going from low to high), if EN is true, then the value on the D input is copied into the register, replacing the previous value. The value in the register is continuously presented on the Q output wires.

BSV FIFOs are implemented as Verilog modules using registers. The details need not concern us here<sup>2</sup> but, suffice it to say, on a clock edge, if EN is true, the data value is loaded into a register in the FIFO representing the tail of the queue.

In the semantics, we said that all actions of the rule are “simultaneous”. In hardware terms: they occur on the same clock edge.

In the semantics, we said that all actions are “instantaneous”. In hardware terms this is the standard digital abstraction, as if clock edges are instantaneous, and as if registers load their values at that instant. In practice, because of physics, clock edges are steep but not vertical (they have a small but finite rise times), but standard digital abstraction suppresses this detail.

Standard digital abstraction also treats combinational circuits as instantaneous, as if there is zero delay in producing outputs from inputs. In practice, signals take small but finite time to propagate through wires and gates. Thus, the ENABLE for `w._write` will be available slightly later than the ENABLE for the other two methods, because it goes through a longer combinational path. But in the standard digital abstraction we idealize all this as zero delay.

**NOTE:** Although Figure 14.3 is useful in developing intuitions, it is important to understand that BSV semantics stands alone, and does not depend on any particular mapping to hardware! Different compilers may map BSV to hardware in different ways, for example using multiple clocks for some or all rules. Indeed a compiler could choose to map BSV code into so-called “asynchronous logic” (which does not have clocks at all).

<sup>2</sup>If you are curious, you can look at the FIFO Verilog codes in the *bsc* library.

**Exercise 14.1:**

Consider the following two alternative ways of writing a rule (that differ only in the order of the rule-body statements):

```
BSV
rule rl_r1 (... condition ...);
  x <= y + 1;
  y <= x + 2;
endrule
```

```
BSV
rule rl_r2 (... condition ...);
  y <= x + 2;
  x <= y + 1;
endrule
```

Sketch the semantic view (and possible hardware) for the two alternatives. Is there any semantic difference between the two rules?

If the values of registers  $x$  and  $y$  are 10 and 20 respectively, what are their values after the rule fires once? What would the answer(s) be with the following similar-looking C statements?

```
C
x = y + 1;
y = x + 2;
```

```
C
y = x + 2;
x = y + 1;
```

□

**14.3.2 A rule firing cannot perform the same action more than once**

From our description that a rule's actions are semantically simultaneous, it should be clear that the same action cannot be performed more than once in a single rule firing. For example:

```
1   rule rl_foo (...);
2     x = 2;           x = 3;
3     f.enq (2);      f.enq (3);
4     g.deq;          g.deq;
5   endrule
```

Each line shows an absurdity: we cannot write two values into the same register at the same instant, nor enqueue two values into the same FIFO at the same instant, nor dequeue two values from the same FIFO at the same instant.

The *bsc* compiler will flag such errors in a program with a message like “Cannot compose actions in parallel”.

This may seem a somewhat minor point (would anyone really write such absurdities in their programs?), but understanding it will help when we discuss mapping multiple rules to a clock in Section 14.5.

## 14.4 Semantics of a collection of rules

The semantics of a collection of rules is deceptively simple: it simply repeats, forever, the single-rule semantics of Section 14.3:

```

while True
    Choose any rule whose CAN_FIRE is true
        Perform the actions in that rule's body

```

Of course, when a rule fires, its actions will have modified some state (a register, a FIFO, etc.). Thus, in the next iteration of this while-loop, a different set of rules may have true CAN\_FIRE conditions.

Revisiting our FIFO example, if a rule R1 invokes `f.first` or `f.dequeue`, it cannot fire if `f` is empty. Some other enabled rule R2 may fire and invoke `f.enqueue`; the FIFO then becomes non-empty, at which R1's CAN\_FIRE may become true, and R1 may become eligible to fire.

Observe that the semantics is *one rule at a time*. We emphasize that this is only at the semantic level, in the same sense that RISC-V ISA semantics is one-instruction-at-a-time and C/C++ semantics is one-statement-at-a-time. Any *implementation* is free to speed things up with concurrency and/or reordering, provided they are consistent with the one-at-a-time semantics so that the programmer has no surprises.

NOTE:

Observe that rule-level semantics is non-deterministic: if several rules' CAN\_FIRE is true, we can choose any one. This is sometimes shocking and scary to the BSV newcomer, but it is in fact very common in formal specification systems (including all those cited below, because forcing a schedule (a particular way to choose enabled rules) is usually an *over-specification* and instead should be left as implementation leeway. Proving any correctness property of a program using the non-deterministic semantics proves it for *all* possible schedules, and is thus more general than a proof for a specific schedule.

Nevertheless, please keep calm and carry on; the `bsc` compiler removes all non-determinism (in a predictable way) when it compiles to hardware.

We cannot emphasize enough that *the semantics is enough for reasoning about functional correctness of all BSV programs*<sup>3</sup>, i.e., “does the program compute what we expect it to compute?”, without appealing to clocks and clocked digital hardware! This includes all of Drum and Fife; in fact, the BSV code for Drum and Fife does not mention any clock, and in this book we do not mention clocks and temporal properties of Drum and Fife until much later, in Chapters 18 and 19.

NOTE:

This, again, can be shocking and scary to the newcomer to BSV who has already learned some digital hardware design. In traditional teaching of digital hardware design, one often introduces clocks and clocked logic practically in Lecture 1, and this suffuses the thinking completely from that point onward.

Again, please keep calm and carry on; the BSV view will grow on you and, over time, becomes the simpler and more intuitive view!

---

<sup>3</sup>There is a nuance regarding how one defines functional correctness when we compute just one deterministic outcome of a non-deterministic program, but we ignore that here.

This semantics of rules is well-known in the Theoretical Computer Science literature, broadly falling under the rubric of “*Term Rewriting Systems*”, because it is a very simple, clean, self-contained computation model (like Turing Machines and Lambda Calculus) ([1, 14, 15, 23]).

Several formal specification systems for concurrent programming use this computation model: well-known examples include:

- Guarded Commands Language (Dijkstra [7])
- UNITY (Chandy and Misra [4])
- Event-B (Metayer, Abrial and Voisin [17])
- TLA+ (Lamport [16])

## 14.5 Mapping rules to a clock, for real-time behavior

With the rule semantics of Section 14.4, we only have an abstract view of time—in an execution we can say whether this rule firing was *before* that rule firing, or *after*; no more. In particular, we cannot ascribe any real-time measure to it (microseconds, nanoseconds, ...).

Digital hardware systems are driven by one or more *clock signals*. Figure 14.4 illustrates a clock signal. A clock signal is an electrical signal (a voltage on a wire), where the voltage-



Figure 14.4: A clock signal

change over time has the shape of a so-called “square wave”, which repeats, or “cycles”, at a fixed time interval (the “cycle time”), indefinitely. Each cycle has a falling edge (or negedge) and a rising edge (or posedge). Modern FPGAs typically run from 10s to 100s of MHz (“megahertz”, of millions of cycles per second), and modern ASICs can run at up to several GHz (“gigahertz”, or billions of cycles per second). Thus, a clock is a real-time reference signal.

Standard digital state elements (*e.g.*, registers) update their values only and exactly on a clock edge (posedge or negedge), and hold that value for the full next cycle, when they may change again. Usually all elements in a circuit react to the same edge, either the posedge or the negedge; for simplicity we’ll assume the posedge.

Specifically, if a D flip flop’s EN input is high (1, true) at the posedge, it updates its state to contain the value on the D input; if the EN input is low (0, false), it does nothing, retaining its previous value. The value in the flip flop is continuously driven on its Q output.

### 14.5.1 Constraints on mapping rules to a clock

To map the collection of rules in a BSV design to clocked digital hardware, as a first approximation, suppose that each rule was mapped as suggested in Figure 14.3. Then,

conceptually, at each posedge, all enabled rules (whose CAN\_FIRE is true) will fire and perform their output actions.

But this would be wrong, for two reasons:

- *Action Conflicts* or *Resource Conflicts*: Two different rules may invoke the same action/actionvalue method (try to write the same register, or enqueue onto the same FIFO, dequeue the same FIFO, etc.. As discussed in Section 14.3.2, this is clearly not feasible at the same instant (same posedge)).
- *Ordering Conflicts*: The *ordering* can be inconsistent with rule semantics. Consider the execution of these two rules:



According to the one-rule-at-a-time semantics, either rule `r1_r1` precedes `r1_r2` or *vice versa*. In either case, register state-update by one rule is observed by the other rule. Whereas, if we execute the rule actions at the same instant, neither rule observes the update by the other rule. Thus, sequential execution and parallel, instantaneous execution will produce inconsistent results. (See also the Exercise at the end of Section 14.3.1.)

Recall in Figure 14.2 that a rule's CAN\_FIRE signal controls its execution completely—if it is false, the rule does nothing. We take advantage of this observation in Figure 14.5, where we pass the CAN\_FIRE values of all rules through a *Rule Controller* function which returns, for each CAN\_FIRE, a corresponding WILL\_FIRE value.



Figure 14.5: Controlling rule execution (CAN\_FIRE excerpt from Figure 14.3)

Whenever a pair of rules `r1_r1` and `r1_r2` would conflict if executed simultaneously, the Rule Controller ensures that it never happens—if both CAN\_FIREs are true, the Rule Controller forces one of the WILL\_FIREs to be false, *i.e.*, it suppresses one of the two rules.

### 14.5.2 The Rule Controller produced by the *bsc* compiler, and reasoning about performance

The abstract description of the Rule Controller above allows for many possible implementations of the Rule Controller; the only criterion for correctness is that the resulting rule execution in hardware should be consistent with one-rule-at-a-time semantics.

The *bsc* compiler produces a simple, state-free, combinational circuit for the Rule Controller. First, it produces a *linear* ordering of all the rules in the program, that results in minimal conflicts (for example, if rule *rl*<sub>1</sub> reads a register and rule *rl*<sub>2</sub> writes the same register, then it tries to place *rl*<sub>1</sub> earlier than *rl*<sub>2</sub> in the ordering, because executing the rules simultaneously would be consistent with that order, *i.e.*, would not conflict).

Second, for any rule *rl*<sub>1</sub> before rule *rl*<sub>2</sub> in the ordering and which would conflict if executed simultaneously, the WILL\_FIRE of *rl*<sub>1</sub> is negated and AND-ed with the CAN\_FIRE of *rl*<sub>2</sub> to suppress the latter rule.

A compile-time flag to the *bsc* compiler will make it dump the “schedule”, *i.e.*, the linear ordering of rules and the conditions under which one rule may suppress another.

With this information, we can reason about the real-time performance of a BSV program, *i.e.*, to address the question: “does the program compute something within the number of clocks we expect it to be computed?” Note, our unit of time is clock cycles, not real-time *per se* (seconds), because BSV has no way to know the actual speed (1 MHz? 1 GHz?) of the clock you might supply to your hardware circuit.

### 14.5.3 Explicit control of rule ordering, and controller optimizations

We mentioned in the previous section that the *bsc* compiler produces a linear ordering of all the rules in a program, attempting to minimize the number of conflicts. This is a heuristic, because producing an “optimal” ordering is undecidable. Further, when the compiler sees a conflict between two rules, it may have no criterion to choose which one has priority, *i.e.*, which one will suppress the other. In such situations, the BSV programmer can provide explicit *attributes* on rules to guide the compiler’s choice.

For example:

```
rule rl_r1 (...);
  ...
endrule

(* descending_urgency = "rl_r1, rl_r2" *)
rule rl_r2 (...);
  ...
endrule
```

Here, the `descending_urgency` attribute advises the compiler to treat rule `rl_r1` with higher priority. The attribute is written just before a rule; it can name this rule and any rules earlier in the source text.

Another attribute advises the compiler, when there is no conflict, to force `rl_r1` to suppress `rl_r2`:

|                                 |     |
|---------------------------------|-----|
| (* preempts = "rl_r1, rl_r2" *) | BSV |
|---------------------------------|-----|

A third attribute asserts to the compiler that the CAN\_FIRE of two rules are mutually exclusive, *i.e.*, can never simultaneously be true (the compiler tries to prove such properties by itself, but cannot always do so), which permits the compiler to simplify the controller.

|                                           |     |
|-------------------------------------------|-----|
| (* mutually_exclusive = "rl_r1, rl_r2" *) | BSV |
|-------------------------------------------|-----|

In this case, the compiler also generates code that checks the mutual-exclusivity property during simulation.

## 14.6 StmtFSM can always be translated into rules

As mentioned in Section 10.2, StmtFSM does not add any fundamental semantic power to BSV; anything expressible with StmtFSM can also be expressed using rules. Here we show a small example:

|                                                                                                                                               |     |
|-----------------------------------------------------------------------------------------------------------------------------------------------|-----|
| Stmt s = while (b)     seq         action1;         if (c)             action2;         else             action3;         action4;     endseq | BSV |
|-----------------------------------------------------------------------------------------------------------------------------------------------|-----|

Here is a translation into rules:

|                                                                                                                                                                                                                                                                                                                                                                                                                         |     |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| // rg_step == 100 means "idle"     // To start the FSM, environment sets rg_step <= 0     Reg #(Bit #(3)) rg_step <- mkReg (100);      rule rl_while (rg_step == 0);         rg_step <= (b ? 1 : 100);     endrule      rule rl_A1 (rg_step == 1);         action1;         rg_step <= (c ? 2 : 3);      // if-then-else     endrule      rule rl_A2 (rg_step == 2);         action2;         rg_step <= 4;     endrule | BSV |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|

```

rule rl_A3 (rg_step == 3);
    action3;
    rg_step <= 4;
endrule

rule rl_A4 (rg_step == 4);
    action4;
    rg_step <= 0;      // back to top of loop
endrule

```

This translation illustrates (a) that `StmtFSM` is just a higher-level abstraction on rules, and (b) that `StmtFSM` can often be much more readable than the corresponding rules—the while-loop, the sequencing, and the if-then-else structure are not readily apparent from the rules.

In Chapter 15 we will show how the FSM for Drum from Chapter 11 can be hand-translated directly into rules. Indeed, the `bsc` compiler performs a similar translation of `StmtFSM` into rules.

## 14.7 When to use `StmtFSM` and when to use rules

This raises the question: when should we use `StmtFSM` and when should we use rules?

For some, it is matter of personal taste; some like the higher-level abstraction of `StmtFSM`, while others like the uniformity of using rules exclusively.

At the very small scale of a single `Action` (perhaps containing sub-actions), this can be the body of a single rule, and this is often more succinct and clear than using `StmtFSM` for just one action. Many modules have a single rule in their body (we will see many such examples in Fife, in later chapters).

Ability to debug is sometimes another consideration. For explicit rules, the programmer supplies rule names, which are presumably carefully chosen for clarity. The rules generated by the `bsc` compiler to implement a `StmtFSM`, on the other hand, have names that are compiler-generated, and may not be so clear (to help legibility, the compiler includes the source file line number in the rule name). Readability of rule names affects us most in compiler-generated error and warning messages, *e.g.*, a message that two sub-actions cannot be placed in the same rule, or a message about the relative schedule-order of two rules.

Another difference is in performance. When the `bsc` compiler translates `StmtFSM` into rules, it may have to make pessimistic assumptions about when it can iterate a loop, or when it can terminate a `par-endpar` block. In these cases, it may be possible to save a clock cycle by writing with rules instead.

### 14.7.1 Use rules for unstructured processes

The situation where rules are often much clearer than `StmtFSM` is when we have an unstructured process, *i.e.*, a process that is not easily expressed with a proper nesting of `Stmt` constructs (sequencing, if-then-else, and loops).

Suppose we modify the example in Section 14.6 as follows: after action2, if a certain condition “c2” holds, we can immediately go back to the top of the loop.

It is possible, but not easy to modify the Stmt to express this “goto top of loop” requirement.<sup>4</sup> On the other hand, the rules version is almost trivial to modify—we just have to modify one rule:

```
rule rl_A2 (rg_step == 2);
    action2;
    rg_step <= (c2 ? 0 : 4);
endrule
```

BSV

---

<sup>4</sup>There is a long-standing theoretical Computer Science result from the 1960s that arbitrary control-flows can be converted into “goto-free” structured programs; but this transformation involves introducing many boolean variables to remember deeply nested conditions, and the resulting programs are not necessarily very readable or efficient.

# Chapter 15

## RISC-V: the Drum unpipelined CPU (using Rules instead of StmtFSM)

### 15.1 Introduction

In Section 11.6 we showed the complete behavior of the Drum CPU module, coded using BSV's `StmtFSM` construct for FSMs.

In this short chapter we show a manual translation of that FSM into BSV Rules. Our first objective is to reinforce the fact that `StmtFSM` does not add any fundamental new computational capability to BSV—anything that could be coded with `StmtFSM` can also be coded with Rules. `StmtFSM` is just a convenient, more readable view of rules, when the FSM follows a structured flow.

In other situations, where the FSM does not follow a structured flow, and certainly where we have multiple cooperating FSMs, Rules are often the more convenient medium of expression. These ideas are explored in the exercises the end of the next section.

### 15.2 The Drum CPU module behavior with Rules

First we define an enumeration type to give symbolic labels to all the Drum FSM actions:

```
src_Drum/CPU.bsv: line 60 ...
1  typedef enum {
2      A_FETCH,
3      A_DECODE,
4      A_REGISTER_READ_AND_DISPATCH,
5      A_EX,
6      A_RETIRE_CONTROL,
7      A_RETIRE_INT,
8      A_RETIRE_DMEM,
9      A_EXCEPTION
10 } CPU_ACTION
11 deriving (Bits, Eq, FShow);
```

Then, we instantiate a register `rg_action` to hold the label of the next action to be performed, initialized to `A_FETCH`. We manually transform each FSM action into a rule. Each rule's explicit condition checks `rg_action`, and the rule-body updates `rg_action` to enable the next action. After the final rule to execute an exception, we assign `A_FETCH` again to `rg_action`, and the whole process repeats itself, forever.

```

1      _____ src_Drum/Drum_Rules.bsv: line 10 ...
2      Reg #(CPU_ACTION) rg_action <- mkReg (A_FETCH);
3
4      rule rl_fetch (rg_running && (rg_action == A_FETCH));
5          a_Fetch;
6          rg_action <= A_DECODE;
7          endrule
8
9      rule rl_decode (rg_action == A_DECODE);
10         a_Decode;
11         rg_action <= A_REGISTER_READ_AND_DISPATCH;
12         endrule
13
14      rule rl_register_read_and_dispatch (rg_action == A_REGISTER_READ_AND_DISPATCH);
15         a_Register_Read_and_Dispatch;
16         rg_action <= A_EX;
17         endrule
18
19      rule rl_Retire_direct ((rg_action == A_EX)
20                             && (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_DIRECT));
21         a_Retire_direct;
22         rg_action <= A_EXCEPTION;
23         endrule
24
25      // BRANCH, JAL, JALR
26      rule rl_EX_Control ((rg_action == A_EX)
27                            && (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_CONTROL));
28         a_EX_Control;
29         rg_action <= A_RETIRE_CONTROL;
30         endrule
31
32      rule rl_Retire_Control (rg_action == A_RETIRE_CONTROL);
33         a_Retire_Control;
34         rg_action <= A_EXCEPTION;
35         endrule
36
37      // LUI, AUIPC, IALU
38      rule rl_EX_Int ((rg_action == A_EX)
39                      && (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_INT));
40         a_EX_Int;
41         rg_action <= A_RETIRE_INT;
42         endrule
43
44      rule rl_Retire_Int (rg_action == A_RETIRE_INT);
45         a_Retire_Int;
46         rg_action <= A_EXCEPTION;
47         endrule

```

```

48 // rule rl_EX_DMem (rg_action == A_EX_DMEM);
49 rule rl_EX_DMem ((rg_action == A_EX)
50     && (rg_Dispatch.to_Retire.exec_tag == EXEC_TAG_DMEM));
51     a_EX_DMem;
52     rg_action <= A_RETIRE_DMEM;
53 endrule
54
55 rule rl_Retire_DMem (rg_action == A_RETIRE_DMEM);
56     a_Retire_DMem;
57     rg_action <= A_EXCEPTION;
58 endrule
59
60 rule rl_exception (rg_action == A_EXCEPTION);
61     if (rg_exception)
62         a_exception;
63     rg_action <= A_FETCH;
64 endrule

```

Please study the FSM version (in Section 11.6) and the above code side-by-side, to see how straightforward it is manually to translate a StmtFSM into Rules.

### 15.2.1 Optimizing the Drum rules

If we execute the Drum CPU on a RISC-V program and study the detailed, cycle-by-cycle execution trace, we will see many possibilities to eliminate “wasted” cycles.

- In rule `rl_Decode` we may detect an exception—the IMem response itself could be an exception; or Decode may find the fetched instruction to be illegal. In case of an exception, instead of proceeding to Register-Read-and-Dispatch, we could directly process the exception and return to Fetch.
- Similarly, rules `rl_EX_Control`, `rl_EX_Int` may raise exceptions; in these cases, too, we could directly deal with the exception and return to Fetch.
- The Retire Direct CSRRxx action may raise an exception, which is handled in the final exception action, which also accesses the CSRs (to update trap registers and return the trap-vector PC). These could be fused into a single method in the CSR module, after which we could return directly to Fetch.
- Because we are using (a subset of) standard CSRs, the only exceptions in a CSRRxx instruction can be because we are addressing an unsupported CSR, or we are attempting to update a read-only CSR. Both these questions can be decided in Decode, in which case we can assume that, in Retire, the CSRRxx treatment will not raise an exception, and can go straight to Fetch.
- The Decode and Register-Read-and-Dispatch rules could be “fused” into a single rule. Currently, the former rule leaves information in a register which is used by the latter rule. If fused, the information can be directly passed as a value, and the intervening register could be eliminated.

Fusing rules together like this comes at a cost: the combinational logic in a rule body becomes more complex. This, in turn, may require us to clock it at a slower speed. Thus, while we may have “saved time” by eliminating a “tick”, we may have made the ticks longer (slower clock). Careful analysis of the actual numbers is needed to judge whether such a transformation is worth it or not.

These optimizations are easier to implement in the Rules version of Drum (instead of the StmtFSM version), because the flow now becomes less structured, with ad-hoc jumps from some rules back to Fetch, *etc..* StmtFSM can only express structured, properly-nested flows.

---

**Exercise 15.1:**

Implement some or all the ideas above and measure the results (circuit size, achievable clock speed, application program speed).

□

---

### 15.3 Conclusion

And that is the complete Drum CPU behavior, written using Rules! The reason this was so easy, and why this chapter is so short, is that we reused all the Action definitions defined earlier for the FSM version of Drum.

# Chapter 16

## RISC-V: the Fife pipelined CPU: Principles

### 16.1 Introduction

In this chapter we turn our attention to Fife, the pipelined CPU. Our focus here is purely on identifying new problems raised by pipelining, and proposing solutions. In the next chapter we will look at BSV code to implement Fife.

Figure 16.1 annotates the abstract execution algorithm in Figure 3.1 with some specifics for the pipelined implementation in Fife. Unlike Drum, where each yellow box was one step in



Figure 16.1: Pipelined interpretation of RISC-V instructions (Fig. 3.1 with some annotations)

a sequential process, now we interpret each yellow box as containing its own infinite process. There are now half-a-dozen or more processes in the diagram (one for each yellow box), all running *concurrently*.

For each of the yellow boxes, we use the word “*step*” for Drum and “*stage*” for Fife. For Fife, each black arrow in the diagram represents a flow of *messages* sent from one stage to another. These messages are sent via FIFO buffers (depicted by  annotations in the figure).

Each stage is an infinite loop, consuming incoming messages and producing outgoing messages. Thus, while the Retire stage is working on instruction  $n$ , the Execute, Register-Read, Decode and Fetch stage(s) may be working on instructions  $n + 1$ ,  $n + 2$ ,  $n + 3$ , and  $n + 4$ , respectively. Thus, there is a sequence, or train, of instructions flowing through the diagram from left to right.

Pipelining raises four new problems, and these are the focus of this chapter:

- Keeping the Fetch Stage Working with PC Prediction and Epochs
- Managing Register Read/Write Hazard with a Scoreboard
- Retiring outputs of the Execute Stages in Order with Tags
- Allowing Memory Ops to be Pipelined with a Store Buffer

## 16.2 Keeping the Fetch Stage Working with PC Prediction and Epochs

What should the Fetch stage do after issuing a request to IMem for instruction  $n$ ? To issue another request, it needs to know the PC of instruction  $n + 1$ , but there are several uncertainties about that next PC:

- The current Fetch itself can raise an exception (trap) if its PC is misaligned, is an unsupported memory address, *etc.*. In this case the next PC will be the trap-handler PC instead of the “normal” next PC.
- If it is a BRANCH instruction, until it reaches the Execute Control stage we do not know the branch target PC address, nor if the branch is taken or not.
- If it is a JAL or JALR instruction, until it reaches the Execute Control stage we do not know the target PC address.
- Many instructions can raise an exception (trap) (illegal instructions, BRANCH/JAL/JALR with misaligned target addresses, DMem ops with misaligned addresses or unsupported addresses, *etc.*); in these cases the next PC will be the trap-handler PC instead of the “normal” next PC.
- The CPU may choose to respond to an external interrupt, in which case the next PC will be the interrupt-handler’s address.

Note, the Fetch stage knows nothing about instruction  $n$  other than its PC. The instruction itself is not known until IMem sends its response to the Decode stage (and assuming the Fetch does not raise an exception).

### 16.2.1 PC Prediction in the Fetch Stage

A standard solution is for the Fetch stage to *predict* the next PC, *i.e.*, make a guess about the next PC. Since all RISC-V RV32 instructions are 32-bits wide (4 bytes), and *most* of

them “fall-through” to the next adjacent instruction, a simple prediction is: PC+4. This prediction will be correct for most instructions, but will be wrong for BRANCH instructions that take the branch, for JAL/JALR instructions, and for any instruction that traps. When the prediction is wrong, the instructions that follow immediately are called “mispredicted” or “wrong-path” instructions.

RISC-V instructions are all 32-bits wide, so PC+4 is a reasonably good guess. In ISAs that have variable-length instructions, prediction may be more complicated. Even in RISC-V, when implementing the “C” (Compressed Instructions) extension, some instructions may be 16-bits wide, raising similar complications.

**NOTE:** Earlier we said “the Fetch stage knows nothing about instruction  $n$  other than its PC”. This is not strictly true—the CPU may have fetched this PC before (*e.g.*, this PC is inside a loop, or in a procedure that is called repeatedly). Knowledge of past behavior can improve the current prediction. Most predictors in modern processors use past history to improve and “tune” their branch predictors dynamically while executing the program. Designing good branch predictors is a deep topic for which there are many good textbooks (for example, [9]).

PC prediction can be seen as a kind of “machine learning”. The CPU’s past execution history constitutes the “training data” for a model, and the model is then asked to predict the next PC for the current PC.

### 16.2.2 Identifying and Flushing Wrong-path Instructions

Clearly, we need to identify and flush wrong-path instructions from the pipeline.

Suppose the Fetch stage issues requests for two instructions  $i_1$  at address  $a_1$  and  $i_2$  at address  $a_2$ , where  $a_2$  is predicted from  $a_1$ . When issuing the request for  $a_1$ , the Fetch stage can pass along  $a_2$  to the Decode stage, from which point it can accompany  $i_1$  as it journeys through the pipeline (*i.e.*, every instruction is accompanied by its next-PC prediction).

When  $i_1$  reaches the Retire stage, we know the *correct* next PC (Trap handler PC? PC+4? Branch-taken target? JAL/JALR target?). By comparing this actual next PC with  $a_2$ , we know whether the successor to  $i_1$  was predicted correctly or not.

If we find that the prediction was correct, there is nothing more to be done; we allow the pipeline flow to proceed.

If we find that the prediction was wrong, then two things must happen:

- We need to *redirect* the Fetch stage to start fetching from the correct next-PC. This involves sending a message from the Retire stage back to the Fetch stage containing the correct next-PC. Suppose the first instruction fetched after this redirection is  $j_1$ .
- Instruction  $i_2$ , and possibly following instructions  $i_3, i_4, \dots$  until  $j_1$  are wrong-path instructions, and must be flushed.

When the Retire stage starts flushing wrong-path instructions  $i_2, i_3, i_4, \dots$  how does it know when it has reached the end of the wrong-path sequence? In other words, how does it know when it sees  $j_1$ ? This is precisely the purpose of the `rg_epoch` register shown in Figure 16.1.

Think of `rg_epoch` as a counter that continuously counts upward. Suppose the current value is  $e_1$ . As described above, when the Retire stage recognizes an instruction whose successor has been mispredicted, we send a redirection message to the Fetch stage with the corrected PC. The Retire stage also increments  $e_1$  and sends the incremented value as part of the redirection. Each time the Fetch stage is redirected, it remembers the new epoch value. It also sends this value down the pipeline, accompanying each instruction fetched with this value.

Now, flushing wrong-path instructions in the Regire stage is easy:  $i_2, i_3, i_4, \dots$  will be accompanied by the old epoch value  $e_1$ , whereas the first correct-path instruction  $j_1$  will be accompanied by the new epoch value  $e_1 + 1$ . Thus, the Retire stage knows exactly which instructions are wrong-path and it can discard them.

---

#### **Exercise 16.1:**

We have describe `rg_epoch` as a counter that is incremented on each recognition of a misprediction. If the register contents have type `Bit #(n)`, then this will wrap-around to 0 after  $2^n$  increments. Is this a problem?

#### **Exercise 16.2:**

If `rg_epoch` contains a `Bit #(n)` value, how small can  $n$  be?

□

---

### **16.2.3 Terminology: Speculative Instructions and Commits**

Before an instruction has reached the Retire stage, there is always the possibility that an earlier instruction that is ahead of it in the pipeline will branch/jump to a PC that was not predicted, or will trap, making this instruction irrelevant. Until this moment, we say that the instruction is still “*speculative*”. When it reaches the Retire stage, we say that its side-effects can be “*committed*”.

### **16.2.4 Speculative instructions should not have any side-effects**

It is not enough for the Retire stage just to discard mispredicted instructions. Instructions have side-effects: they may modify registers and write to memory. We must ensure that speculative instructions make no modifications that are visible to right-path instructions that follow, until they reach the Retire stage. The details of how this is accomplished will be seen in Section 16.5 and Section 16.6.

## **16.3 Managing Register Read/Write Hazards with a Scoreboard**

Suppose instruction  $i_1$  writes to register  $x_7$ , and the following instruction  $i_2$  reads from register  $x_7$ . Instruction  $i_1$ ’s write to  $x_7$  only happens in the Retire stage. If  $i_2$  were to follow behind  $i_1$  immediately, it will be in the Exec stage, and would have already read  $x_7$ .

earlier when it was in the Register-Read stage. In other words, it would have read a *stale* or obsolete value  $x_7$ . This is called a Read-Write *hazard*, or a read-after-write *dependency*.

In this situation, we need to make  $i_2$  wait in the Register-Read stage until  $i_1$  has completed its update of  $x_7$ . This is precisely the purpose of the **scoreboard** shown in Figure 16.1.

The **scoreboard** is an array of 32 1-bit registers (one bit for each GPR). When an instruction (such as  $i_1$ ) passes through the Register-Read stage, if it writes to register  $x_7$ , we set the corresponding bit 7 in the scoreboard to 1, indicating that  $x_7$  is “busy”. When  $i_1$  reaches the Retire stage and writes to the register, it also resets the scoreboard bit 7 to 0, indicating that  $x_7$  is “not busy”.

When an instruction (such as  $i_2$ ) reaches the Register-Read stage and wants to read a register (such as  $x_7$ ), if the corresponding scoreboard bit says it is busy, then the Register-Read stage “*stalls*”, *i.e.*, it waits until the scoreboard condition is cleared (by  $i_1$  in the Retire stage).

While  $i_2$  is waiting in the stalled Register-Read stage, note that the following Execute stage may become “empty”, *i.e.*, there is no instruction occupying that stage. We refer to this as a “*pipeline bubble*”.

---

### Exercise 16.3:

For two consecutive instructions  $i_1$  and  $i_2$ ,

- $i_1$  may want to write register  $x_7$  and  $i_2$  may want to read  $x_7$ ,
- $i_1$  may want to write register  $x_7$  and  $i_2$  may write to write  $x_7$ ,
- $i_1$  may want to read register  $x_7$  and  $i_2$  may want to read  $x_7$ ,
- $i_1$  may want to read register  $x_7$  and  $i_2$  may want to write  $x_7$ .

Above, we motivated scoreboards with the first scenario. What about the other three scenarios?

### Exercise 16.4:

Write-write hazards can be treated just like read-after-write hazards. Alternatively the 1-bit in the scoreboard for a register (say  $x_7$ ) can be generalized into an  $n$  bit up/downcounter, indicating the number of instructions that have been allowed into Execute pipelines that intend to write  $x_7$ . The Retire stage decrements this counter; the Register-Read stage stalls an instruction if these counters (for its input registers) are non-zero; and the Register-Read stage increments this counter for an instruction’s destination register.

Implement a scoreboard module with this scheme. What should happen if a counter reaches its maximum value? How many bits should each counter have?

### Exercise 16.5:

In the counter-based scoreboard of the previous exercise, if there are multiple instructions in the Execute stages that intend to write  $x_7$ , in what order can those writes occur? What would be the consequences of a wrong order?

*Hint:* The answer is in Section 16.4.



### 16.3.1 Releasing Scoreboard Reservations for Uncommitted Instructions

When the Retire stage discards an instruction due to mis-speculation, that instruction may be one which normally would have written a result value into register  $x_j$  in the register file. If so, when it passed through the Register-Read-and-Dispatch stage, it would have marked register  $x_j$  as “busy” in the scoreboard. Now, when we discard the instruction, even though we do not write any value into register  $x_j$ , we still need to release the “busy” reservation on  $x_j$  (otherwise, any succeeding instruction trying to read  $x_j$  will get stuck in the Register-Read stage. Said another way, the reservation in the scoreboard is a side-effect of this instruction that needs to be rolled back.

### 16.3.2 Bypassing

Digital hardware usually runs in time units of “clock cycles”. The Retire stage writes a GPR (possibly) and writes the scoreboard (to mark it “not-busy”). The Register-Read stage reads zero to two GPRs, reads the scoreboard (to check “not-busy”) and writes the scoreboard (to set “busy”).

For ordinary registers a write is only visible on the next clock cycle. Thus, if the scoreboard is just an ordinary register, the Register-Read stage cannot observe “not-busy” until one clock after the Retire stage has marked “not-busy”. This does not affect correctness, but slows the performance of the CPU.

It is possible to design some extra circuitry around the scoreboard so that the Register-Read stage can observe “not-busy” on the *same* clock cycle as when Retire marks it “not-busy”. This technique is generically called “*bypassing*” or “*short-circuiting*”.

**Exercise 16.6:**

Implement a scoreboard unit with bypassing/short-circuiting as described in the above note.

*Hint:* Needs BSV “Concurrent Registers” (advanced topic!)

**Exercise 16.7:**

What are the implications of bypassing/short-circuiting on the length of combinational paths in a design, and the consequent effect on achievable clock frequency?

□

An even more advanced form of bypassing (with much more circuit complexity) would be:

- Eliminate the scoreboard; do not stall an instruction in the Register-Read stage, but allow it to move into its appropriate Execute stage, and stall it there if necessary. This frees up the Register-Read stage to process the next instruction, which may move into a different Execute stage.
- When Retire writes a register value, broadcast it to the different Execute stages to enable instructions there that are stalled on this register value.

## 16.4 Retiring outputs of the Execute Stages in Order with Tags

In Figure 16.1, each yellow box in the Execute stage is an independent pipeline handling a certain subset of the instruction set. For example, “Execute Control” handles BRANCH, JAL and JALR instructions. “Execute Integer Arithmetic and Logic Ops” handles LUI, AUIPC, and all arithmetic and logic instructions. “Exec Mem Op” handles LOAD and STORE instructions. If we extend Fife to handle the “M” ISA extension, we would have a pipeline for integer multiply and divide instructions. If we extend Fife to handle the “F” and “D” ISA extension, we would have a pipeline for floating point arithmetic. The Register-Read-and-Dispatch stage sends information into these pipes depending on the kind of instruction.

Instructions may have different latencies in traversing these Execute pipes. For example, Control and Integer ops may typically traverse in one clock, but multiplication, division, floating point and memory ops may take more clocks. The latency variation may be data dependent: for example multiplication/division may recognize the special case where an operand is 0 or 1 and return a result quickly. A memory op may return quickly on a cache hit, and take more time on a cache miss.

The Retire stage needs to gather the outputs from the Execute stages and retire them in the proper order. But, because of varying latency, availability of data is not an indication of the proper order.

The solution to this “ordering” problem is *tags*. In Figure 16.1 we see there is also a *direct path* from Register-Read-and-Dispatch to Retire. We pass a tag on this path *for every instruction*. For example if the instruction is a BRANCH instruction, the Register-Read-and-Dispatch stage sends information into Execute Control, but it also sends a tag EXEC\_TAG\_CONTROL on the direct path to Retire, indicating that it has just dispatched an instruction into Execute Control.

Thus, the sequence of tags on the direct path tells Retire exactly the order in which to service the various Execute pipes. Retire always looks at tag on the direct path first. For example, if Retire sees a EXEC\_TAG\_DMEM tag on the direct path, it knows that it must next look for an output from the Exec Memory Ops pipe, even if outputs are already available on the Execute Control and/or Execute Integer pipes from later instructions.

## 16.5 Allowing Memory Ops to be Pipelined, with a Store Buffer

Consider the Execute Memory Ops stage in Figure 16.1, which issues LOAD and STORE requests to memory and collects their responses. At this stage, the instruction is still speculative; it may be discarded when it reaches the Retire stage. We must ensure that STORE instructions do not yet modify memory permanently.

The mechanism for this purpose is the “*store buffer*” shown Figure 16.1. This is a buffer in front of the memory system (between the CPU and the memory system).

- The store buffer is itself a queue.
- When we execute a STORE instruction, the address, data and size are appended to the STORE buffer queue. When the instruction finally reaches the Retire stage, the Retire stage sends a final “commit/discard” message to the store-buffer. If a commit,

the STORE at the head of the store-buffer is committed to memory and we dequeue it from the store-buffer. If a discard, we just dequeue it from the store-buffer.

- When we execute a LOAD instruction, we first check the store-buffer if there have been any recent updates to the address in question, and then go to the memory system behind it if necessary.

### 16.5.1 What about LOADs and STOREs to non-memory-like devices (MMIO)?

In RISC-V there are no separate instructions for input and output to devices. Devices contain “device registers” which are placed at particular “memory addresses” and are accessed from the CPU just like memory, with LOAD and STORE instructions. We say that the device registers are “mapped” to those addresses. These accesses can control and configure the device, and move data between the CPU and the device. Such a scheme is known as MMIO, for Memory-Mapped Input-Output.

Device registers, although addressed like memory locations, may behave quite differently from memory, in several ways:

- LOADs may have side-effects. For example, a LOAD from a memory location does not (observably) disturb anything, but a LOAD from a device could switch on an LED, or increment a counter.
- LOADs may not be idempotent. For example, two successive LOADs from a memory location return the same value, whereas two successive LOADs from, say, a UART’s “receiver buffer register” may return two successive (and different) characters from a keyboard.
- STOREs may have additional side-effects. A STORE to a memory location merely stores the value there. A STORE to a device register may display it on a screen, start a motor, release a wheel brake, and so on.
- In memory, a LOAD returns the value stored there by the most recent STORE. For a UART device, on the other hand, a STORE may display a character on a screen whereas a subsequent LOAD from the same address may return a character from the keyboard.

For these reasons, it is dangerous to perform any LOAD or STORE speculatively on non-memory devices. The “Execute Memory Ops” stage does not even attempt the access; it simply defers such a request for future execution by the Retire stage. The decision whether to defer a request or not is based on the address.

Once the LOAD/STORE instruction has reached the Retire stage and we know for sure that it is not speculative, the Retire stage performs the memory operation: it sends the request to memory, collects the response, and retires the instruction.

#### **Exercise 16.8:**

When the Retire stage sends a “commit/discard” message, how do we know that it applies to the pending STORE at the head of the store-buffer queue?

**Exercise 16.9:**

For a speculative LOAD, its value may come from memory or from the store-buffer. In what order should it scan these sources?

**Exercise 16.10:**

Does the Retire stage need to send both “commit” and “discard” messages? If requests are accompanied by an instruction-number, the Retire stage could only send “commit” messages accompanied by the instruction-number. The store-buffer can discard all pending STOREs with earlier instruction numbers.

Discuss the implications of such a design on the size of the store-buffer.

□

## 16.6 The Retire Stage

Figure 16.2 summarizes the actions to be taken by Fife’s Retire stage.



Figure 16.2: Retire actions in Fife

Much of this is the same as in Figure 11.2 for Drum. The new additional details are:

- The “update PC” operation now sends a message to redirect Fetch *only* if the successor to the current PC had been mispredicted. Recall that we carry the predicted PC value along with the instruction through the pipe, so here we can compare the now-known actual next PC with the predicted PC.

If we redirect, we increment the epoch number and send the new epoch number along with the redirection. Any subsequent instructions with the old epoch number are “wrong path” instructions and must be discarded.

- Wrong path: any of the Execute pipes can produce a wrong-path instruction (accompanying epoch does not match current epoch). In each such case, we discard the instruction, but there are two more needed actions:
  - If-path instruction has an Rd, we would have taken a scoreboard reservation for it in the Register-Read stage; we must now release that scoreboard reservation. We extend `update_rd` to perform a scoreboard-release without a register-write.
  - In the case of a non-deferred STORE instruction, we would have placed an entry in the store-buffer. We use “finalize store buffer” to discard that entry.
- Correct path non-deferred STORE instruction: we would have placed an entry in the store-buffer. We use “finalize store buffer” to commit that entry to memory.
- “Deferred” memory request from DMem (because it was for a non-memory-like address such as MMIO): If so, the second Exec Mem Ops box in the figure now performs the DMem operation.

# Chapter 17

## RISC-V: the Fife pipelined CPU code

### 17.1 Introduction

In this chapter we study BSV code to implement the principles that were discussed in Chapter 16. We repeat Figure 16.1 here, for reference.



Figure 17.1: Pipelined interpretation of RISC-V instructions (Fig. 3.1 with some annotations)

### 17.2 The Fife top-level CPU module

The code for the top-level Fife CPU module is actually simpler than the code for the Drum CPU module, because it simply instantiates sub-modules for each stage and connects them:

```
1      src_Fife/CPU.bsv: line 45 ...
2      (* synthesize *)
3      module mkCPU (CPU_IFC);
```

```

3 // =====
4 // STATE
5 ...
6 Fetch_IFC      stage_F          <- mkFetch;
7 Decode_IFC     stage_D          <- mkDecode;
8 RR_RW_IFC      stage_RR_RW     <- mkRR_RW;
9 EX_Control_IFC stage_EX_Control <- mkEX_Control; // Branch, JAL, JALR
10 EX_Int_IFC    stage_EX_Int    <- mkEX_Int;        // Integer ops
11 Retire_IFC    stage_Retire    <- mkRetire;
12
13 // -----
14 // Forward flow connections
15
16 // Fetch->Decode->RR-Dispatch, and direct path RR-Dispatch->Retire
17 mkConnection (stage_F.fo_Fetch_to_Decode, stage_D.fi_Fetch_to_Decode);
18 mkConnection (stage_D.fo_Decode_to_RR,     stage_RR_RW.fi_Decode_to_RR);
19 mkConnection (stage_RR_RW.fo_RR_to_Retire, stage_Retire.fi_RR_to_Retire);
20
21 // RR-Dispatch->various EX
22 mkConnection (stage_RR_RW.fo_RR_to_EX_Control,
23                 stage_EX_Control.fi_RR_to_EX_Control);
24 mkConnection (stage_RR_RW.fo_RR_to_EX_Int,
25                 stage_EX_Int.fi_RR_to_EX_Int);
26
27 // Various EX->Retire
28 mkConnection (stage_EX_Control.fo_EX_Control_to_Retire,
29                 stage_Retire.fi_EX_Control_to_Retire);
30 mkConnection (stage_EX_Int.fo_EX_Int_to_Retire,
31                 stage_Retire.fi_EX_Int_to_Retire);
32
33 // -----
34 // Backward flow connections
35
36 // Fetch<-Retire (redirection)
37 mkConnection (stage_Retire.fo_Fetch_from_Retire, stage_F.fi_Fetch_from_Retire);
38 // RR-Dispatch<-Retire (register writeback)
39 mkConnection (stage_Retire.fo_RW_from_Retire, stage_RR_RW.fi_RW_from_Retire);
40
41 // =====
42 // BEHAVIOR: all behavior is inside the above modules
43 // =====
44 // INTERFACE
45
46 method Action init (Initial_Params initial_params);
47     ...
48     stage_F.init (initial_params);
49     stage_D.init (initial_params);
50     stage_RR_RW.init (initial_params);
51     stage_EX_Control.init (initial_params);
52     stage_EX_Int.init (initial_params);
53     stage_Retire.init (initial_params);
54 endmethod
55
56 // IMem

```

```

57   interface fo_IMem_req = stage_F.fo_Fetch_to_IMem;
58   interface fi_IMem_rsp = stage_D.fi_IMem_to_Decode;
59
60   // DMem, speculative
61   interface fo_DMem_S_req    = stage_RR_RW.fo_DMem_S_req;
62   interface fi_DMem_S_rsp    = stage_Retire.fi_DMem_S_rsp;
63   interface fo_DMem_S_commit = stage_Retire.fo_DMem_S_commit;
64
65   // DMem, non-speculative
66   interface fo_DMem_req = stage_Retire.fo_DMem_req;
67   interface fi_DMem_rsp = stage_Retire.fi_DMem_rsp;
68
69   // Set TIME
70   method Action set_TIME (Bit #(64) t) = stage_Retire.set_TIME (t);
71 endmodule

```

This is practically a direct textual description of Figure 17.1. The STATE section first instantiates the pipeline stages shown in the figure. There is no explicit module corresponding to Execute Memory Ops—the DMem request is sent out directly from `stage_RR_RW` and the DMem response is collected directly by `stage_Retire`.

The STATE section then instantiates the “forward-flow” connections between modules (left to right in the figure), using the `bsc` library module `mkConnection` to connect a `FIFOF_O` interface (producer) to a `FIFOF_I` interface (consumer), which was discussed in Section 8.5.6. These module connections are in the STATE section because, as discussed in Section 8.5.6 `mkConnection` is just a module instantiation.

The next few lines instantiate the “backward-flow” connections.

In the INTERFACE section, after the `init` method, the next two lines are the flows of IMem requests from the Fetch stage to memory and IMem responses from memory to the Decode stage. These just lift interfaces from `stage_F` and `stage_D` to the CPU interface, as is.

The next three lines are for *speculative* DMem access, which we discussed in Section 16.5: the flow of DMem requests from `stage_RR` to memory, the flow of DMem responses from memory to `stage_Retire`, and the flow of “commit/discard” messages from `stage_Retire` to the store-buffer to discharge STOREs that are waiting in the store-buffer.

The last two lines are for *non-speculative* DMem access, which we discussed in Section 16.5.1.

Note that the module interface `CPU_IFC` is exactly the same as in Drum (although Drum has no need for, and does not use the `DMem_S` speculative interfaces). Thus, in a system context, we can directly substitute Drum for Fife and vice versa. Generalizing this idea, we can develop other CPUs and substitute them, as well.

We next go through the code for the individual stage modules.

### 17.3 How we connect stages

Each of the FIFO-labeled black arrows in Figure 17.1 is a FIFO-like connection between stages. The general scheme we use is illustrated in Figure 17.2.



Figure 17.2: How we connect Fife stages

In the upstream (producer) stage we instantiate a `mkPipelineFIFO`. The rules in the stage enqueue outgoing information into the `FIFO_I` side of the FIFO using the `enq` method, and we lift the `FIFO_O` side to the stage-module interface.

Conversely, in the downstream (consumer) stage we instantiate a `mkBypassFIFO`. The rules in the stage consume incoming information from the `FIFO_O` side of the FIFO using the `first` and `deq` methods, and we lift the `FIFO_I` side to the stage-module interface.

Finally, in the parent module (`mkCPU`), we use `mkConnection` to connect the exposed `FIFO_O` and `FIFO_I` ends of the FIFOs.

PipelineFIFOs, BypassFIFOs and this way of connecting stages are discussed in great detail in Section 18.5. Suffice it to say, here, that:

- Despite there being two FIFOs, data can traverse from producer to consumer in 1 tick, as desired.
- The structure allows the producer and consumer to be compiled independently by *bsc*, with no “rule-scheduling” constraints leaking across stage boundaries.
- There are no combinational paths crossing the stage boundary (through the two FIFOs).
- The structure allows us to reason about (and prove) correctness of each stage completely independently of other stages.

All of these are pleasing “modularity” properties of the design.

## 17.4 The Fetch stage

The Fetch stage module interface is shown below. Apart from the `init` method, the remaining sub-interfaces correspond to the FIFO-labeled arrows in Figure 17.1: the outgoing `FIFOF_O` interfaces to memory (IMem) and the Decode stage, and the incoming `FIFOF_I` for redirections from the Retire stage.

```

src_Fife/S1_Fetch.bsv: line 33 ...
1  interface Fetch_IFC;
2      method Action init (Initial_Params initial_params);
3
4      // Forward out
5      interface FIFOF_O #(Fetch_to_Decode) fo_Fetch_to_Decode;
6      interface FIFOF_O #(Mem_Req)          fo_Fetch_to_IMem;
7

```

```

8   // Backward in
9   interface FIFO_I #(Fetch_from_Retire) fi_Fetch_from_Retire;
10  endinterface

```

The Fetch stage module code is shown below.

```

src_Fife/S1_Fetch.bsv: line 47 ...
1 (* synthesize *)
2 module mkFetch (Fetch_IFC);
3   // -----
4   // STATE
5   Reg #(File) rg_flog    <- mkReg (InvalidFile);      // Debugging
6   Reg #(Bool) rg_running <- mkReg (False);
7
8   // Forward out
9   FIFO #(Fetch_to_Decode) f_Fetch_to_Decode <- mkBypassFIFO;
10  FIFO #(Mem_Req)          f_Fetch_to_IMem    <- mkBypassFIFO;
11
12  // Backward in
13  FIFO #(Fetch_from_Retire) f_Fetch_from_Retire <- mkPipelineFIFO;
14
15  // inum, PC and epoch registers
16  Reg #(Bit #(64))        rg_inum  <- mkReg (0);
17  Reg #(Bit #(XLEN))      rg_pc     <- mkReg (0);
18  Reg #(Bit #(W_Epoch))   rg_epoch <- mkReg (0);
19  ...
20
21  // -----
22  // BEHAVIOR
23
24  // Forward flow
25  rule rl_Fetch_req (rg_running
26            && (! f_Fetch_from_Retire.notEmpty)
27            ...
28            // Predict next PC
29            let pred_pc = rg_pc + 4;
30
31            let y <- fn_Fetch (rg_pc, pred_pc, rg_epoch, rg_inum, rg_flog);
32            f_Fetch_to_Decode.enq (y.to_D);
33            f_Fetch_to_IMem.enq (y.mem_req);
34
35            rg_pc    <= pred_pc;
36            rg_inum <= rg_inum + 1;
37            ...
38  endrule
39
40  // Backward flow: redirection from Retire
41  rule rl_Fetch_from_Retire ((! rg_oiaat) || (! rg_oiaat_fetch));
42    let x <- pop_o (to_FIFO_I (f_Fetch_from_Retire));
43    rg_pc    <= x.next_pc;
44    rg_epoch <= x.next_epoch;
45    ...
46  endrule
47

```

```

48 // -----
49 // INTERFACE
50
51     method Action init (Initial_Params initial_params) if (! rg_running);
52         rg_flog    <= initial_params.flog;
53         rg_pc      <= initial_params.pc_reset_value;
54         rg_running <= True;
55     endmethod
56
57     // Forward out
58     interface fo_Fetch_to_Decode = to_FIFOF_0 (f_Fetch_to_Decode);
59     interface fo_Fetch_to_IMem   = to_FIFOF_0 (f_Fetch_to_IMem);
60
61     // Backward in
62     interface fi_Fetch_from_Retire = to_FIFOF_I (f_Fetch_from_Retire);
63 endmodule

```

The STATE section instantiates various registers and FIFOs. Next, the BEHAVIOR section contains two *rules* (Chapter 14). In rule `rl_Fetch_req`, the explicit condition is the expression:

```
(rg_running && (! f_Fetch_from_Retire.notEmpty))
```

Rule `rl_Fetch_req`'s implicit condition comes from the methods that it invokes, `f_Fetch_to_Decode.enq ()` and `f_Fetch_to_IMem.enq ()`. The rule will fire only when the explicit condition is true, and when both FIFOs are enabled to enqueue (have space). When it fires, it performs the composite action that comprises all the actions in the rule body:

- enqueue the value of `y.to_D` into FIFO `f_Fetch_to_Decode`;
- enqueue the value of `y.mem_req` into FIFO `f_Fetch_to_IMem`;
- write the predicted PC value `pred_pc` into register `rg_pc`, and
- write the value of `rg_inum+1` into the register `rg_inum`.

The right-hand side expressions (`rg_pc+4`, `fn_Fetch(...)` and `rg_inum+1`) are all combinational circuits, and “y” is not a register, just a name for the wires carrying the output value of `fn_Fetch()`.

**NOTE:** The function `fn_Fetch()` is exactly the same as the one used in the Fetch step of Drum (was described in Section 7.2).

The types of the messages passed to the Decode stage (`y.to_D` of type `F_to_D`) and to memory (`y.mem_req` of type `Mem_Req`) are the same as in Drum.

All these actions are semantically *instantaneous* and *simultaneous*. Note that when the rule's implicit and explicit conditions are true, all the actions are performed; if false, none of them are performed, *i.e.*, the rule is “atomic”.

In summary, rule `rl_Fetch_req` computes an IMem memory request from the PC and sends it to memory; it sends auxiliary information to the Decode stage; and it updates the PC and inum in preparation for the next Fetch (the next firing of the rule)..

The second rule in the module is `rl_Fetch_from_Retire`. It receives, in `x`, a redirection message from the Retire stage, and updates the PC and epoch accordingly. This rule has no explicit conditions; its single implicit condition comes from `f_Fetch_from_Retire`'s implicit condition that we cannot pop a value from the FIFO until it is non-empty, *i.e.*, this rule only fires when a redirection message is available. When it fires, it performs three actions atomically/instantaneously/simultaneously:

- It dequeues  $x$  from the FIFO `f_F_from_Retire` (the dequeue action is inside the `pop_o` function),
- It updates `rg_pc` with the new PC in the redirection message,
- It updates `rg_epoch` with the new epoch in the redirection message.

Note, `rl_Fetch_from_Retire` updates two registers `rg_pc` and `rg_epoch` and, *concurrently*, `rl_Fetch` reads both those registers. Because rule actions are atomic, we are guaranteed that `rl_Fetch` will not see inconsistent values in those two registers, where one has been updated but the other has not yet been updated.

Finally, the INTERFACE section of the module is simple. After the `init` method, we simply lift the FIFO interfaces to the `mkFetch` module interface.

#### 17.4.1 Prioritizing rule `rl_Fetch_from_Retire` over `rl_Fetch_req`

What would happen if we omit the condition “`(!f_Fetch_from_Retire.notEmpty)`” in `rl_Fetch_req`? It would not affect the correctness of this module at all. If we omit the condition, then both rules could be enabled simultaneously, and it is non-deterministic which one will fire first. Actually, the hardware is deterministic because the `bsc` compiler will make a prioritization choice, but its choice is unspecified, and so semantically it is non-deterministic (*e.g.*, a new release of the compiler, or a later change in other code in the module may prioritize differently). But, for correctness, the choice *does not matter*. Because of the atomicity of rules, the rules will never *interleave* their behavior—either one goes before the other, or vice versa. We can quickly reason that, in either case, we get functionally correct behavior.

The only difference will be in performance: if both rules are enabled, then, by definition, `rl_Fetch` is fetching “wrong-path” instructions due to an earlier misprediction (which is why we have a redirection). We know that wrong-path instructions will be discarded and have no effect. So if `rl_Fetch` is prioritized, it will fetch one more wrong-path instruction, with no impact on correctness.

By adding the “`(!f_Fetch_from_Retire.notEmpty)`” condition to `rl_Fetch`, we are explicitly prioritizing `rl_Fetch_from_Retire` when both are enabled, thus avoiding `rl_Fetch_req` performing a wasted wrong-path fetch.

## 17.5 The Decode stage

The Decode stage module interface is shown below. Apart from the `init` method, the remaining sub-interfaces correspond to the FIFO-labeled arrows in Figure 17.1: the incoming `FIFOF_I` interfaces from the Fetch stage and memory (IMem), and the outgoing `FIFOF_O` interface to the Register-Read stage.

```
src_Fife/S2_Decode.bsv: line 33 ...
1 interface Decode_IFC;
2     method Action init (Initial_Params initial_params);
3
4     // Forward in
5     interface FIFOF_I #(Fetch_to_Decode) fi_Fetch_to_Decode;
6     interface FIFOF_I #(Mem_Rsp) fi_IMem_to_Decode;
7
8     // Forward out
9     interface FIFOF_O #(Decode_to_RR) fo_Decode_to_RR;
10    endinterface
```

The Decode stage module code is shown below.

```

src_Fife/S2_Decode.bsv: line 46 ...
1 (* synthesize *)
2 module mkDecode (Decode_IFC);
3 // =====
4 // STATE
5 Reg #(File) rg_flog <- mkReg (InvalidFile); // debugging
6
7 // Forward flows in
8 // Depth should be > F=>IMem=>D path latency
9 FIFOF #(Fetch_to_Decode) f_Fetch_to_Decode <- mkSizedFIFOF (4);
10 FIFOF #(Mem_Rsp) f_IMem_to_Decode <- mkPipelineFIFOF;
11
12 // Forward flow out
13 FIFOF #(Decode_to_RR) f_Decode_to_RR <- mkBypassFIFOF;
14
15 // =====
16 // BEHAVIOR
17
18 rule rl_Decode;
19     Fetch_to_Decode x <- pop_o (to_FIFOF_0 (f_Fetch_to_Decode));
20     Mem_Rsp rsp_IMem <- pop_o (to_FIFOF_0 (f_IMem_to_Decode));
21
22     Decode_to_RR y <- fn_Decode (x, rsp_IMem, rg_flog);
23
24     f_Decode_to_RR.enq (y);
25     ...
26
27 endrule
28
29 // =====
30 // INTERFACE
31
32 method Action init (Initial_Params initial_params);
33     rg_flog <= initial_params.flog;
34 endmethod
35
36 // Forward flows in
37 interface fi_Fetch_to_Decode = to_FIFOF_I (f_Fetch_to_Decode);
38 interface fi_IMem_to_Decode = to_FIFOF_I (f_IMem_to_Decode);
39 // Forward flows out
40 interface fo_Decode_to_RR = to_FIFOF_O (f_Decode_to_RR);
41 endmodule

```

The STATE section instantiates FIFOs for incoming and outgoing flows.

The BEHAVIOR section has the single rule `rl_Decode`, whose implicit conditions will make it wait for both incoming FIFOs `f_Fetch_to_Decode` and `f_IMem_to_Decode` to be non-empty, and for its outgoing FIFO `f_Decode_to_RR` to have space. When the rule fires, it:

- pops `x` and `rsp_Mem` from the two FIFOs, respectively;
- applies function `fn_Decode()` to those values (this is the *same* `fn_Decode()` that was used in the Decode step of Drum, and described in Section 7.3), and
- sends the result `y` of type `Decode_to_RR` on to the Register-Read stage.

Note that `fn_Decode`'s `Decode_to_RR` output has a boolean `has_rd` field which is:

- false for those instructions that do not have an rd field;
- true for those instructions that do have an rd field, *and where rd is not zero.*

The INTERFACE section is again straightforward, just lifting the FIFO interfaces to this module's interface.

### 17.5.1 Balancing concurrent paths in a pipeline

In Figure 17.1 we see that there are two concurrent FIFO-like paths from Fetch to Decode: one direct, and the other via memory (IMem). Fetch produces two outputs together, one for each path. The paths re-converge at Decode, where Fetch's first output “meets” the corresponding response from memory.

The path via IMem is also FIFO-like because good memory systems are themselves pipelined; it is possible for Fetch to issue several consecutive memory requests before the first response arrives at Decode. We say that multiple IMem transactions may be “in flight”. Also, in Section 3.2.1 we discussed how memory latency can be variable and unpredictable, so the number of transactions in flight may vary.

If we wish to allow up to  $n$  IMem transactions in flight, then the direct Fetch-to-Decode path must be capable of buffering up to  $n$  `Fetch_to_Decode` items. If not, rule `r1_Fetch` in Fetch will get stuck: its `f_Fetch_to_Decode` FIFO will be full, and so the “`enq`” method's implicit condition will become false, preventing the rule from firing.

This is the reason, in Decode, the incoming `f_Fetch_to_Decode` FIFO is instantiated with `mkSizedFIFO(4)`, which is a FIFO with the capacity to hold four items<sup>1</sup>. We also use the term “balancing” for this—the number of items that can be “in flight” in the two paths should be matched.

Why did we choose the number 4? Why not something smaller, or larger? Greater capacity requires more hardware resources, which argues for a smaller number, but a smaller number may affect performance where Fetch has to wait only because of limited capacity. We would like to choose the smallest number that covers the number of IMem transactions that may be in flight (memory latency, or depth of IMem pipeline). But that number varies with memory system design and, even for a particular design, it varies over time—it depends on the particular RISC-V program running, the memory system's caches, hits and misses, cache interactions, cache miss latencies, and so on<sup>2</sup>. Thus, there is no obvious “optimal” choice for `mkSizedFIFO` capacity; we must “tune” the number by running the CPU on desired applications, with different FIFO capacities, observe the effect on performance and resources, and pick an “acceptable” capacity.

## 17.6 The Register-Read and Dispatch (and Register-Write) stage

The Register-Read and Dispatch module contains (instantiates) the GPRs register file (Section 9.2) and the scoreboard to manage read-write hazards (Section 16.3). In the forward flow,

- it stalls (waits) if the instruction has rs1, rs2 or rd, and these are busy according to the scoreboard;
- it reads rs1 and rs2 registers for the current instruction;
- it sets the scoreboard for the current instruction's rd to 1, marking it “busy” (if the instruction has an rd);

---

<sup>1</sup>We could also have, equivalently, instantiated the `f_Fetch_to_Decode` FIFO in Fetch with `mkSizedFIFO`, or split the buffering capacity between the two stages.

<sup>2</sup>With virtual memory, it will also depend on TLB misses and the latency of page table walks.

- it uses information from the Decode stage to dispatch to the four Execute pipes. We always (for every instruction) send an execution tag and additional information on the direct path to Retire. Depending on the instruction, it may also send information into one of the other Execute pipes:
  - Execute Control
  - Execute Integer
  - memory (a DMem request)

This stage also participates in a backward flow, because the GPRs register file and scoreboard are instantiated here. When an instruction is completed, the Retire stage sends a message to this module to update those components: release an rd reservation in the scoreboard and write-back an rd register value.

The interface for the Register-Read and Dispatch stage module is shown below. Apart from the `init` method, the remaining sub-interfaces correspond to the FIFO-labeled arrows in Figure 17.1: the incoming `FIFOF_I` interface from the Decode stage, the outgoing `FIFOF_O` interfaces direct to Retire and to Execute Control, Execute Int and to DMem, and an incoming `FIFOF_I` from Retire for updating the register file and scoreboard, which are instantiated inside this module.

```
src_Fife/S3_RR_RW.bsv: line 40 ...
1 interface RR_RW_IFC;
2   method Action init (Initial_Params initial_params);
3
4   // Forward in
5   interface FIFOF_I #(Decode_to_RR) fi_Decode_to_RR;
6
7   // Forward out
8   interface FIFOF_O #(RR_to_Retire) fo_RR_to_Retire;
9   interface FIFOF_O #(RR_to_EX_Control) fo_RR_to_EX_Control;
10  interface FIFOF_O #(RR_to_EX) fo_RR_to_EX_Int;
11  interface FIFOF_O #(Mem_Req) fo_DMem_S_req;
12
13  // Backward in
14  interface FIFOF_I #(RW_from_Retire) fi_RW_from_Retire;
15 endinterface
```

### 17.6.1 BSV: Vectors for the Scoreboard

In Section 16.3 we discussed the general principles of a scoreboard, and described it as an array of 1-bit values. In BSV the following type is used to represent an array of  $n$  items, each of which is of type  $t$ :

```
Vector #(n, t);
```

Note: in order to use this type in any BSV code file, the file must import the `Vector` library:

```
import Vector :: *;
```

So, we can define a `Scoreboard` type as follows:

```
typedef Vector #(32, Bit #(1)) Scoreboard;
```

The BSV Vector-library function:

```
replicate (v)
```

creates a value of type `Vector #(n, t)` where  $n$  is inferred from the context and  $v$  has the type  $t$ . All  $n$  items in the value are equal to  $v$ . Thus, we can instantiate a scoreboard like this, where all the vector elements are initialized to 0:

```
Reg #(Scoreboard) rg_scoreboard <- mkReg (replicate (0));
```

In hardware, a `Vector#(32,Bit#(1))` value occupies exactly 32 bits, *i.e.*, the size of the vector times the size of each element. So, why not use `Bit #(32)` instead? It's a matter of programming taste:

- The same syntax  $v[j]$  works both for bit-selection from `Bit#(n)` and `Vector#(n, Bit#(1))`.
- With `Bit#(n)`, a  $j^{th}$  bit can also be selected using shift-and-mask operations:  $((v >> j) \& 1)$ .
- The  $j^{th}$  bit of `Vector#(n, Bit#(1))` can be updated using simple assignment  
 $v[j] = new\_value;$
- The  $j^{th}$  bit of `Bit#(n)` can be updated using shift and mask operations:  
 $(v | (1 << j))$  to set the  $j^{th}$  bit to 1, and  
 $(v \& (^ (1 << j)))$  to reset the  $j^{th}$  bit to 0

We can convert a `Vector #(32, Bit #(1))` value into a `Bit#(32)` value with:

```
pack (v)
```

and we can convert a `Bit#(32)` value into a `Vector #(32, Bit #(1))` value with:

```
unpack (v)
```

### 17.6.2 The Register-Read and Dispatch (and Register-Write) module

The first part of the `mkRR_RW` module, instantiating its STATE, is shown below:

```
src_Fife/S3_RR_RW.bsv: line 58 ...
1 (* synthesize *)
2 module mkRR_RW (RR_RW_IFC);
3   ...
4   // Forward in
5   FIFOF #(Decode_to_RR) f_Decode_to_RR <- mkPipelineFIFO;
6
7   // Forward out
8   FIFOF #(RR_to_Retire)   f_RR_to_Retire    <- mkBypassFIFO; // Direct
9   FIFOF #(RR_to_EX_Control) f_RR_to_EX_Control <- mkBypassFIFO;
10  FIFOF #(RR_to_EX)        f_RR_to_EX_Int     <- mkBypassFIFO;
11  FIFOF #(Mem_Req)         f_DMem_S_req      <- mkBypassFIFO;
12
13  // Backward in
14  FIFOF #(RW_from_Retire) f_RW_from_Retire <- mkPipelineFIFO;
15
16  // General-Purpose Registers (GPRs)
17  GPRs_IFC #(XLEN) gprs <- mkGPRs_synth;
18
19  // Scoreboard for GPRs
20  Reg #(Scoreboard) rg_scoreboard <- mkReg (replicate (0));
```

It instantiates FIFOs for all the forward and backward flows, and it instantiates the GPRs register file `gprs` (Section 9.2) and the `scoreboard` (Section 16.3). The second part of the `mkRR_RW` module, implementing its BEHAVIOR for the forward flow, is shown below:

```

src_Fife/S3_RR_RW.bsv: line 82 ...
1 // =====
2 // BEHAVIOR: Forward
3
4 rule rl_RR_Dispatch (! f_RW_from_Retire.notEmpty);
5     ...
6     let x      = f_Decode_to_RR.first;
7     let instr   = x.instr;
8     let opclass = x.opclass;
9     let rs1    = instr_rs1 (instr);
10    let rs2    = instr_rs2 (instr);
11    let rd     = instr_rd  (instr);
12
13    let scoreboard = rg_scoreboard;
14    let busy_rs1  = (x.has_rs1 && (scoreboard [rs1] != 0));
15    let busy_rs2  = (x.has_rs2 && (scoreboard [rs2] != 0));
16    let busy_rd   = (x.has_rd  && (scoreboard [rd]   != 0));
17    Bool stall    = (busy_rs1 || busy_rs2 || busy_rd);
18
19    if (stall) begin
20        // No action
21        ...
22    end
23    else begin
24        ...
25        f_Decode_to_RR.deq;
26
27        // Read GPRs.
28        // Ok even if instr does not have rs1 or rs2
29        // values used only if relevant.
30        let rs1_val = gprs.read_rs1 (rs1);
31        let rs2_val = gprs.read_rs2 (rs2);
32
33        // Dispatch to one of the next-stage pipes
34        Result_Dispatch y <- fn_Dispatch (x, rs1_val, rs2_val, rg_flog);
35
36        // Update scoreboard for Rd
37        if (x.has_rd) begin
38            scoreboard [rd] = 1;
39            rg_scoreboard <= scoreboard;
40        end
41
42        // Direct to Retire
43        f_RR_to_Retire.enq (y.to_Retire);
44
45        // Dispatch
46        case (y.to_Retire.exec_tag)
47            EXEC_TAG_DIRECT: noAction;
48            EXEC_TAG_CONTROL: f_RR_to_EX_Control.enq (y.to_EX_Control);
49            EXEC_TAG_INT:    f_RR_to_EX_Int.enq (y.to_EX);
50            EXEC_TAG_DMEM:   f_DMem_S_req.enq (y.to_EX_DMem);

```

```

51      endcase
52      ...
53
54      end
55  endrule

```

In line 6, we observe the first element in the `f_Decode_to_RR` FIFO. Note, the `.first` method is non-destructive, *i.e.*, it merely observes and *does not dequeue* the first element from the FIFO. This is because, if we must stall, it needs to be available again the next time the rule fires.

The next several lines compute the stall condition by checking the scoreboard for whether `rs1`, `rs2` or `rd` are busy (if the instruction has `rs1`, `rs2` or `rd`, respectively). Note that an instruction with an `rd` field, but where `rd==0`, will not stall because `Decode` will have set `has_rd` to false.

If we stall, the rule takes no action; everything will be retried the next time it fires.

If we do not stall, then we dequeue the `f_Decode_to_RR` FIFO. We read values of the `rs1` and `rs2` registers. Note, if the instruction does not have an `rs1` or `rs2`, here we will be reading some random registers according to the bits that happen to be in the `rs1` and `rs2` bit-positions in the instruction. This does not matter; in the Execute stage each instruction only *uses* these values if the instruction has an `rs1` and/or `rs2`.

Next, we apply the function `fn_Dispatch()` (discussed in Section 7.4, and is the same one we use in Drum) to the inputs, which computes `y`, containing the structs to be sent on the direct path (`y.to_Retire`), to Execute Control (`y.to_Control`), and to Execute Int and Execute DMem (`y.toEX`).

If the instruction has an `rd`, we reserve it on the scoreboard. Note that if an instruction has an `rd` field, but `rd==0`, `Decode` would have marked `has_rd` false, and so we never take a reservation on register `x0`.

We enqueue `y.to_Retire` on the direct flow (FIFO `f_RR_to_Retire`).

The remaining code is a nested if-then-else that sends information into the appropriate Execute pipe (Execute Control, Execute Integer, or DMem).

### Exercise 17.1:

Consider this hypothetical scenario: suppose the `stall` condition is false. Then, we need to dequeue `f_Decode_to_RR` and do one or more enqueues into `f_RR_to_Retire` and `f_RR_to_EX_Control`, `f_RR_to_EX_Int` or `f_DMem_S_req.enq`. Is it possible that we perform the dequeue and then are unable to perform the enqueue(s) because the corresponding output FIFO happens to be full?

*Hint:* rule atomicity

### Exercise 17.2:

Write a boolean expression representing the overall firing condition for the rule. Briefly: all FIFO-modifying actions (dequeue, enqueue) have implicit conditions, but for each FIFO, that condition is only relevant if the conditions on the surrounding if-then-else's select that action.

### Exercise 17.3:

When debugging the implementation, it is useful to know if, due to some coding mistake, rule `r1_RR_Dispatch` is stalled forever. For example, for some instruction with an `rd`, if the Retire stage did not send back the `rd`'s scoreboard-release, that register will be forever “busy”.

Add a register to count consecutive stalls, initially 0. In the rule, whenever we successfully dispatch an instruction, reset the counter to 0. Whenever we stall, increment the stall counter, and if the

stall-counter reaches some chosen threshold value, prints debugging messages and executes \$finish to terminate simulation.

□

The third part of the `mkRR_RW` module, implementing its BEHAVIOR for the backward flow, is shown below:

```
src_Fife/S3_RR_RW.bsv: line 163 ...
1 // =====
2 // BEHAVIOR: Backward: reg write from retire
3
4 rule rl_RW_from_Retire;
5     let x <- pop_o (to_FIFOF_O (f_RW_from_Retire));
6
7     Scoreboard scoreboard = rg_scoreboard;
8     scoreboard [x.rd] = 0;
9     rg_scoreboard <= scoreboard;
10
11    if (x.commit)
12        gprs.write_rd (x.rd, x.data);
13    ...
14 endrule
```

We pop the message `x` from the `f_RW_from_Retire` FIFO. We perform its specified scoreboard-release for register `rd`. If the `rd` value is to be committed, we write it into GPR [`rd`]. The fourth and final part of the `mkRR_RW` module, defining its INTERFACE, is shown below:

```
src_Fife/S3_RR_RW.bsv: line 182 ...
1 // =====
2 // INTERFACE
3
4 method Action init (Initial_Params initial_params);
5     rg_flog <= initial_params.flog;
6     endmethod
7
8 // Forward in
9 interface fi_Decode_to_RR = to_FIFOF_I (f_Decode_to_RR);
10
11 // Forward out
12 interface fo_RR_to_Retire      = to_FIFOF_O (f_RR_to_Retire);
13 interface fo_RR_to_EX_Control = to_FIFOF_O (f_RR_to_EX_Control);
14 interface fo_RR_to_EX_Int    = to_FIFOF_O (f_RR_to_EX_Int);
15 interface fo_DMem_S_req      = to_FIFOF_O (f_DMem_S_req);
16
17 // Backward in
18 interface fi_RW_from_Retire = to_FIFOF_I (f_RW_from_Retire);
19 endmodule
```

It simply lifts the FIFO interfaces to this module's interface.

## 17.7 The Execute Control stage

The interface for the Execute Control stage module is shown below. Apart from the `init` method, the remaining sub-interfaces correspond to the FIFO-labeled arrows in Figure 17.1: the incoming `FIFOF_I` interface from the Register-Read and Dispatch stage and the outgoing `FIFOF_O` interface to the Retire stage.

```
src_Fife/S4_EX_Control.bsv: line 32 ...
1 interface EX_Control_IFC;
2     method Action init (Initial_Params initial_params);
3
4     // Forward in
5     interface FIFOF_I #(RR_to_EX_Control)      fi_RR_to_EX_Control;
6     // Forward out
7     interface FIFOF_O #(EX_Control_to_Retire) fo_EX_Control_to_Retire;
8 endinterface
```

The Execute Control stage module is shown below:

```
src_Fife/S4_EX_Control.bsv: line 43 ...
1 (* synthesize *)
2 module mkEX_Control (EX_Control_IFC);
3     Reg #(File) rg_flog <- mkReg (InvalidFile);      // debugging
4
5     // Forward in
6     FIFOF #(RR_to_EX_Control)      f_RR_to_EX_Control      <- mkPipelineFIFO;
7     // Forward out
8     FIFOF #(EX_Control_to_Retire)  f_EX_Control_to_Retire <- mkBypassFIFO;
9
10    // =====
11    // BEHAVIOR
12
13    rule rl_EX_Control;
14        let x <- pop_o (to_FIFOF_O (f_RR_to_EX_Control));
15        let y <- fn_EX_Control (x, rg_flog);
16        f_EX_Control_to_Retire.enq (y);
17        ...
18    endrule
19
20    // =====
21    // INTERFACE
22
23    method Action init (Initial_Params initial_params);
24        rg_flog <= initial_params.flog;
25    endmethod
26
27    // Forward in
28    interface fi_RR_to_EX_Control      = to_FIFOF_I (f_RR_to_EX_Control);
29    // Forward out
30    interface fo_EX_Control_to_Retire = to_FIFOF_O (f_EX_Control_to_Retire);
31 endmodule
```

After instantiating the forward flow input and output FIFOs, the rule `rl_EX_Control` simply applies the function `fn_Control` to each input and enqueues the output. This function was described in

Section 7.5, and is the same one we use in Drum. Finally, the interface, after the `init` method, simply lifts the FIFO interfaces to the interface of this module.

## 17.8 The Execute Integer Ops stage

The interface for the Execute Integer stage module is shown below. Apart from the `init` method, the remaining sub-interfaces correspond to the FIFO-labeled arrows in Figure 17.1: the incoming `FIFOF_I` interface from the Register-Read and Dispatch stage and the outgoing `FIFOF_O` interface to the Retire stage.

```
src_Fife/S4_EX_Int.bsv: line 36 ...
1 interface EX_Int_IFC;
2     method Action init (Initial_Params initial_params);
3
4     // Forward in
5     interface FIFOF_I #(RR_to_EX)      fi_RR_to_EX_Int;
6     // Forward out
7     interface FIFOF_O #(EX_to_Retire) fo_EX_Int_to_Retire;
8 endinterface
```

The Execute Integer stage module is shown below:

```
src_Fife/S4_EX_Int.bsv: line 47 ...
1 (* synthesize *)
2 module mkEX_Int (EX_Int_IFC);
3     Reg #(File) rg_flog <- mkReg (InvalidFile);      // debugging
4
5     // Forward in
6     FIFOF #(RR_to_EX)      f_RR_to_EX_Int      <- mkPipelineFIFOF;
7     // Forward out
8     FIFOF #(EX_to_Retire)  f_EX_Int_to_Retire <- mkBypassFIFOF;
9
10    // =====
11    // BEHAVIOR
12
13    rule rl_EX_Int;
14        let x <- pop_o (to_FIFOF_O (f_RR_to_EX_Int));
15        let y <- fn_EX_Int (x, rg_flog);
16        f_EX_Int_to_Retire.enq (y);
17        ...
18    endrule
19
20    // =====
21    // INTERFACE
22
23    method Action init (Initial_Params initial_params);
24        rg_flog <= initial_params.flog;
25    endmethod
26
27    // Forward in
28    interface fi_RR_to_EX_Int = to_FIFOF_I (f_RR_to_EX_Int);
29
30    // Forward out
```

```

31     interface fo_EX_Int_to_Retire = to_FIFOF_O (f_EX_Int_to_Retire);
32 endmodule

```

After instantiating the forward flow input and output FIFOs, the rule `r1_EX_Int` simply applies the function `fn_EX_Int` to each input and enqueues the output. This function was described in Section 7.6, and is the same one we use in Drum. Finally, the interface, after the `init` method, simply lifts the FIFO interfaces to the interface of this module.

## 17.9 The Execute Memory Ops stage (speculative DMem)

There is no explicit code for an Execute Memory Ops stage. The forward-path rule `r1_RR_Dispatch` in the Register-Read-and-Dispatch stage, described in Section 17.6, directly enqueues a memory request that goes out to memory. The Retire stage, to be discussed in Section 17.10, consumes the corresponding memory response.

## 17.10 Fife: the Retire stage

This module is longer than the others only because it takes care of many possible cases as outlined in Figure 17.3 (which repeats Figure 16.2 here for reference).



Figure 17.3: Actions in the “Retire” stage of Fife (same as Fig. 16.2)

Here is the interface definition for this module:

```

src_Fife/S5_Retire.bsv: line 33 ...
1 interface Retire_IFC;
2   method Action init (Initial_Params initial_params);
3
4   // Forward in
5   interface FIFO_I #(RR_to_Retire)          fi_RR_to_Retire;
6   interface FIFO_I #(EX_Control_to_Retire)  fi_EX_Control_to_Retire;

```

```

7   interface FIFOF_I #(EX_to_Retire)           fi_EX_Int_to_Retire;
8
9   // DMem, speculative
10  interface FIFOF_I #(Mem_Rsp)                 fi_DMem_S_rsp;
11  interface FIFOF_O #(Retire_to_DMem_Commit) fo_DMem_S_commit;
12
13  // DMem, non-speculative
14  interface FIFOF_O #(Mem_Req)    fo_DMem_req;
15  interface FIFOF_I #(Mem_Rsp)    fi_DMem_rsp;
16
17  // Backward out
18  interface FIFOF_O #(Fetch_from_Retire)  fo_Fetch_from_Retire;
19  interface FIFOF_O #(RW_from_Retire)     fo_RW_from_Retire;
20
21  // Set TIME
22  (* always_ready, always_enabled *)
23  method Action set_TIME (Bit #(64) t);
24 endinterface

```

The first four **FIFO\_I** sub-interfaces correspond to the black arrows entering the Retire module from the execute pipes (left of Retire module in the figure).

The next **FIFO\_O** interface, **fo\_DMem\_S\_commit**, is the purple arrow sending commit/discard messages to the store-buffer.

The next two sub-interfaces are red arrows connecting the second Exec Mem Ops box to memory (non-speculative DMem request and response).

The last two **FIFO\_O** sub-interfaces are the blue arrow at the top of the figure carrying redirections to the Fetch stage, and the green arrow carrying Register-Writes to the **RR\_RW** module.

This module has a mixture of modes. Normally it is in pipeline mode, receiving instructions from the Execute stages and retiring them. In certain circumstances, it goes into FSM mode, similar to Drum:

- Execute non-speculative DMem ops (MMIO/non-memory-like), FSM-sequencing through issuing a request to memory and handling the response.
- Execute a CSRRxx instruction, which may raise an exception, which must then be handled.
- Execute traps (for exceptions or interrupts).

We define a “module mode” to reflect these modes:

```

src_Fife/S5_Retire.bsv: line 60 ...
1  typedef enum {MODE_PIPE,          // Normal pipeline operation
2          MODE_DMEM_RSP,        // Handle Non-speculative DMem response
3          MODE_EXCEPTION
4 } Module_Mode
5 deriving (Bits, Eq, FShow);

```

The first part of the Retire stage **mkRetire** module is shown below.

```

src_Fife/S5_Retire.bsv: line 68 ...
1  (* synthesize *)
2  module mkRetire (Retire_IFC);
3      Reg #(File) rg_flog <- mkReg (InvalidFile);    // debugging
4
5      // Control-and-Status Registers (CSRs)

```

```

6   CSRs_IFC csrs <- mkCSRs;
7
8   // For managing speculation, redirection, traps, etc.
9   Reg #(Epoch) rg_epoch <- mkReg (0);
10
11  // Forward in
12  // Depth of f_RR_to_Retire should be > longest EX pipe
13  FIFOF #(RR_to_Retire)           f_RR_to_Retire      <- mkSizedFIFO(8);
14  FIFOF #(EX_Control_to_Retire)  f_EX_Control_to_Retire <- mkPipelineFIFO();
15  FIFOF #(EX_to_Retire)          f_EX_Int_to_Retire    <- mkPipelineFIFO();
16  FIFOF #(Mem_Rsp)              f_DMem_S_rsp        <- mkPipelineFIFO();
17
18  // Forward out
19  FIFOF #(Retire_to_DMem_Commit) f_DMem_S_commit     <- mkBypassFIFO();
20
21  // Backward out
22  FIFOF #(Fetch_from_Retire)    f_Fetch_from_Retire <- mkBypassFIFO();
23  FIFOF #(RW_from_Retire)       f_RW_from_Retire    <- mkBypassFIFO();
24
25  // Non-speculative DMem reqs and rsp
26  FIFOF #(Mem_Req)             f_DMem_req         <- mkBypassFIFO();
27  FIFOF #(Mem_Rsp)             f_DMem_rsp         <- mkPipelineFIFO();
28
29  Reg #(Module_Mode) rg_mode <- mkReg (MODE_PIPE);
30
31  // Regs to set up exception handling
32  Reg #(Bit #(XLEN)) rg_epc  <- mkRegU;
33  Reg #(Bit #(4))    rg_cause <- mkRegU;
34  Reg #(Bit #(XLEN)) rg_tval <- mkRegU;

```

It instantiates the CSR registers (Section 9.3), and register `rg_epoch` to keep track of the epoch (Section 16.2), and then FIFOs for all the incoming and outgoing channels. Finally, it instantiates register `rg_mode` to hold the current module mode, and a couple of registers to hold exception information until we can perform the exception actions.

### 17.10.1 Common facilities used by many rules

Before we study the rules in the module, we first encapsulate, in Action functions, some common actions performed in many rules.

The following function captures the actions to be taken whenever we need to redirect the Fetch stage to start fetching from a different PC. This can be due to a misprediction of the successor of the current instruction, or due to an exception or MRET.

```

src_Fife/S5_Retire.bsv: line 113 ...
1  function Action fa_redirect_Fetch (Bool      mispredicted,
2                                     RR_to_Retire x1,
3                                     Bit #(XLEN)  next_pc);
4
5   action
6     if (mispredicted || rg_oiaat) begin
7       let next_epoch = rg_epoch + 1;
8       rg_epoch <= next_epoch;
9       let y = Fetch_from_Retire {next_pc:    next_pc,
                           next_epoch: next_epoch,

```

```

10          ...
11          f_Fetch_from_Retire.enq (y);
12          ...
13      end
14  endaction
15 endfunction

```

The boolean `mispredicted` indicates whether the prediction was correct or not. If the prediction was correct, no action is taken. Otherwise, we increment `rg_epoch`, and send a redirection message to the Fetch stage with the correct PC and the new epoch. The consumption of this message in the Fetch stage was discussed in Section 17.4.

The following function captures the actions to be taken for updating an instruction's destination rd register. If the instruction has an rd register, it assembles a `RW_to_Retire` message and sends it to the `RR_RW` module. The consumption of this message in the Register-Read and Dispatch (and Register-Write) stage was discussed in Section 17.6.

```

src_Fife/S5_Retire.bsv: line 135 ...
1 // * unreserve the scoreboard
2 // * if commit, also write the rd_val
3 function Action fa_update_rd (RR_to_Retire x1,
4                               Bool           commit,
5                               Bit #(XLEN)    rd_val);
6     action
7         if (x1.has_rd) begin
8             let y = RW_from_Retire {rd:
9                             instr_rd (x1.instr),
10                            commit:   commit,
11                            data:     rd_val,
12                            ...
13             f_RW_from_Retire.enq (y);
14             ...
15         end
16     endaction
17 endfunction

```

The following function sends a commit/discard message to the speculative DMem store-buffer, if the instruction writes to memory:

```

src_Fife/S5_Retire.bsv: line 161 ...
1 function Action fa_retire_store_buf (RR_to_Retire x1,
2                                     Mem_Rsp      mem_rsp,
3                                     Bool        commit);
4     action
5         if (x1.writes_mem && (mem_rsp.rsp_type == MEM_RSP_OK)) begin
6             let y = Retire_to_DMem_Commit{commit: commit,
7                                         ...
8             f_DMem_S_commit.enq (y);
9         end
10    endaction
11 endfunction

```

The `f_RR_to_Retire` (direct path) FIFO contains an entry for *every* instruction. Information in the first element tells us how to retire it, including which of the execute pipes need to be examined. The following definitions capture some properties of this first element.

```

src_Fife/S5_Retire.bsv: line 178 ...
1 // mispredictions and in-order merging of all pipes.
2 RR_to_Retire x_rr_to_retire = f_RR_to_Retire.first;
3
4 Bool wrong_path = (x_rr_to_retire.epoch != rg_epoch);
5 Bool is_Direct = (x_rr_to_retire.exec_tag == EXEC_TAG_DIRECT);
6 Bool is_Control = (x_rr_to_retire.exec_tag == EXEC_TAG_CONTROL);
7 Bool is_Int = (x_rr_to_retire.exec_tag == EXEC_TAG_INT);
8 Bool is_DMem = (x_rr_to_retire.exec_tag == EXEC_TAG_DMEM);

```

The incoming instruction is a wrong-path instruction if its accompanying epoch does not match our rg\_epoch register.

### 17.10.2 Rule to retire wrong-path instructions (all paths; discard)

For a wrong-path instruction, we dequeue it from f\_RR\_to\_Retire.

If it has an Rd, we send a discard-scoreboard-reservation message to the RR-RW module using fa\_update\_rd().

If it was a Control, Int or DMem instruction, we dequeue it from the corresponding incoming FIFO. In the DMem case, if it did a speculative STORE, we also send a “discard” message to the head of the store-buffer using fa\_retire\_store\_buf().

```

src_Fife/S5_Retire.bsv: line 190 ...
1 rule rl_Retire_wrong_path ((rg_mode == MODE_PIPE)
2   && wrong_path);
3   f_RR_to_Retire.deq;
4
5   // Unreserve/commit rd if needed
6   fa_update_rd (x_rr_to_retire, False, ?);
7
8   // Discard related pipe
9   if (is_Control) f_EX_Control_to_Retire.deq;
10  if (is_Int) f_EX_Int_to_Retire.deq;
11  if (is_DMem) begin
12    let mem_rsp <- pop_o (to_FIFOF_O (f_DMem_S_rsp));
13    // Send 'discard' (False) to store-buf, if needed
14    fa_retire_store_buf (x_rr_to_retire, mem_rsp, False);
15  end
16  ...
17 endrule

```

### 17.10.3 Rules to retire from direct path

The following subsections handle instruction-retirement for the cases where the direct-path has all the necessary information, i.e., where the instruction was not dispatched into the Execute Control, Execute Int or Execute DMem paths.

#### 17.10.3.1 Rule to retire CSRRxx instructions (direct path)

The following rule handles CSRRxx instructions, which arrive on the direct path:

```

src_Fife/S5_Retire.bsv: line 219 ...
1 rule rl_Retire_CSRRxx ((rg_mode == MODE_PIPE)
2   && (! wrong_path)
3   && is_Direct
4   && (! x_rr_to_retire.exception)
5   && is_legal_CSRRxx (x_rr_to_retire.instr));
6   f_RR_to_Retire.deq;
7
8   match { .exc, .rd_val } <- csrs.mav_csrrxx (x_rr_to_retire.instr,
9         x_rr_to_retire.rs1_val);
10  // Unreserve/commit rd if needed
11  fa_update_rd (x_rr_to_retire, (! exc), rd_val);
12
13  if (! exc) begin
14    Bool mispredicted = (x_rr_to_retire.predicted_pc
15                          != x_rr_to_retire.fallthru_pc);
16    fa_redirect_Fetch (mispredicted,
17                        x_rr_to_retire,
18                        x_rr_to_retire.fallthru_pc);
19  end
20  else begin
21    rg_epc    <= x_rr_to_retire.pc;
22    rg_cause <= cause_ILLEGAL_INSTRUCTION;
23    rg_tval  <= x_rr_to_retire.instr;
24    rg_mode   <= MODE_EXCEPTION;
25  end
26  ...
27 endrule

```

The rule's explicit condition restricts it to fire only when we are in PIPE mode and when the next instruction is not a wrong-path instruction (epoch is correct), is a Direct-path instruction, is not an exception, and is indeed a CSRRxx instruction.

If so, it applies `csrs.csrrxx()` (described in Section 9.3, and the same one we used in Drum), and updates the rd register with the result.

If the CSRRxx operation did not raise an exception, it redirects Fetch this instruction's successor was mispredicted.

If the CSRRxx operation raised an exception, we save the details in some registers and switch from PIPE mode into EXCEPTION mode; this will enable rule `rl_exception` (Section 17.10.7) to handle it.

Why postpone exception-handling to the later rule? Why not perform that rule's action here? The reason is to avoid contention on the CSR registers. Rule `rl_Retire_CSRRxx` invokes method `csrs.mav_csrrxx()` and rule `rl_exception` invokes method `csrs.mav_exception()`. By separating these invocations into two separate rules, we avoid contention between these two CSR module methods.

#### Exercise 17.4:

The provided implementation of the CSR registers (Section 9.3) will in fact prevent the two methods `csrs.mav_csrrxx()` and `csrs.mav_exception()` from being invoked in a single action, *i.e.*, in a single rule, due to method-ordering constraints. Try this experiment: copy the body of `rl_exception` into `rl_Retire_CSRRxx`; try compiling it with `bsc`, and observe the compiler's complaint message.

#### Exercise 17.5:

Modify the CSRs module method `mav_csrrxx()` to combine the actions of the original `mav_csrrxx()` with the actions of `mav_exception()`. The result type will have to be extended (say, a 3-tuple) to also return the trap-vector PC in case of an exception. Modify `rl_Retire_CSRRxx` so that it immediately does a Fetch-redirect with the trap-vector PC in case of an exception. There is no longer any need to shift out of `MODE_PIPE` into `MODE_EXCEPTION`.

This will save a cycle in handling CSRRxx instructions. But what is the impact on hardware complexity in the `mkCSRs` module?

□

### 17.10.3.2 Rule to retire MRET instructions (direct path)

The following rule handles MRET instructions, which arrive on the direct path:

```
src_Fife/S5_Retire.bsv: line 251 ...
1  rule rl_Retire_MRET ((rg_mode == MODE_PIPE)
2      && (! wrong_path)
3      && is_Direct
4      && (! x_rr_to_retire.exception)
5      && is_legal_MRET (x_rr_to_retire.instr));
6      f_RR_to_Retire.deq;
7      Bool mispredicted = True;
8      fa_redirect_Fetch (mispredicted, x_rr_to_retire, csrs.read_epc);
9      csrs.ma_incr_instret;
10     ...
11 endrule
```

Again, pay careful attention to the rule's explicit condition which restricts it to fire only when we are in PIPE mode and when the next instruction is not a wrong-path instruction (epoch is correct), is a Direct-path instruction, is not an exception, and is indeed an MRET instruction.

The rule body is simple—always redirect to the PC found in the CSR MEPC (Section 2.7).

### 17.10.3.3 Rule to retire ECALL and EBREAK instructions (direct path)

The following rule handles ECALL and EBREAK instructions, which arrive on the direct path:

```
src_Fife/S5_Retire.bsv: line 267 ...
1  rule rl_Retire_ECALL_EBREAK ((rg_mode == MODE_PIPE)
2      && (! wrong_path)
3      && is_Direct
4      && (! x_rr_to_retire.exception)
5      && (is_legal_ECALL (x_rr_to_retire.instr)
6          || is_legal_EBREAK (x_rr_to_retire.instr)));
7      rg_epc    <= x_rr_to_retire.pc;
8      rg_cause  <= ((x_rr_to_retire.instr [20] == 0)
9                      ? cause_ECALL_FROM_M
10                     : cause_BREAKPOINT);
11     rg_tval   <= 0;
12     csrs.ma_incr_instret;
13     rg_mode  <= MODE_EXCEPTION;
14     ...
15 endrule
```

The rule's explicit condition restricts it to fire only when we are in PIPE mode and when the next instruction is not a wrong-path instruction (epoch is correct), is a Direct-path instruction, is not an exception, and is indeed an ECALL or EBREAK instruction.

ECALL and EBREAK are just deliberately-invoked exceptions, the only difference being the exception-code we pass in the `mcause` CSR (Section 2.7.1). We save appropriate exception values and switch from `MODE_PIPE` to `MODE_EXCEPTION`, which will handle the exception.

#### Exercise 17.6:

In the exercises in Section 17.10.3.1, we discussed why, in `rl_Retire_CSRRxx` we postponed exception handling to the separate rule `rl_exception`: `rl_Retire_CSRRxx` itself accesses the CSRs, and `rl_exception` accesses the CSRs, and we wanted to avoid contention.

Here, `rl_Retire_ECALL_EBREAK` does not otherwise access the CSRs, so there is no such contention, so why not handle the exception directly here (copy the body of `rl_exception` here) and save a cycle? Discuss the impact on hardware structure. Try it, examine the generated Verilog, and examine the performance impact (which will depend on the RISC-V program being run, of course).

□

#### 17.10.3.4 Rule to retire exceptions from the direct path

The following rule handles exceptions which arrive on the direct path:

```

src_Fife/S5_Retire.bsv: line 287 ...
1   rule rl_Retire_Direct_exception ((rg_mode == MODE_PIPE)
2                               && (! wrong_path)
3                               && is_Direct
4                               && x_rr_to_retire.exception);
5     f_RR_to_Retire.deq;
6
7     rg_epc    <= x_rr_to_retire.pc;
8     rg_cause  <= x_rr_to_retire.cause;
9     rg_tval   <= x_rr_to_retire.tval;
10    rg_mode   <= MODE_EXCEPTION;
11    ...
12
13  endrule

```

The rule's explicit condition restricts it to fire only when we are in PIPE mode and when the next instruction is not a wrong-path instruction (epoch is correct), is a Direct-path instruction, and is an exception.

The rule does nothing more than save the exception information in registers and move from `MODE_PIPE` to `MODE_EXCEPTION` so that the exception will be handled by `rl_exception`.

#### 17.10.4 Rule to retire from the Execute Control path

The following rule handles instructions (BRANCH, JAL, JALR) that arrive on the Execute Control path. In Execute Control, it could have succeeded or it could have raised an exception (if the control-transfer target PC was misaligned).

```

src_Fife/S5_Retire.bsv: line 310 ...
1 rule rl_Retire_EX_Control ((rg_mode == MODE_PIPE)
2   && (! wrong_path)
3   && is_Control);
4   f_RR_to_Retire.deq;
5   let x2 <- pop_o (to_FIFOF_0 (f_EX_Control_to_Retire));
6
7   // Unreserve/commit rd if needed
8   fa_update_rd (x_rr_to_retire, (! x2.exception), x2.data);
9
10  if (! x2.exception) begin
11    // Redirect Fetch PC if mispredicted
12    Bool mispredicted = (x_rr_to_retire.predicted_pc != x2.next_pc);
13    fa_redirect_Fetch (mispredicted, x_rr_to_retire, x2.next_pc);
14    csrs.ma_incr_instret;
15  end
16  else begin
17    rg_epc    <= x_rr_to_retire.pc;
18    rg_cause <= x2.cause;
19    rg_tval   <= x2.tval;
20    rg_mode  <= MODE_EXCEPTION;
21  end
22  ...
23 endrule

```

We dequeue the direct-path information (`f_RR_to_Retire`) and we pop the Execute Control path information (`f_EX_Control_to_Retire`).

We invoke `fa_update_rd()` to update the destination register (JAL and JALR may save a “return address” in rd) and its scoreboard reservation. Recall from Section 17.10.1 that `fa_update_rd()` will not send any update message if the instruction does not have an rd (or if rd is 0). The “`(!x2.exception)`” argument to `fa_update_rd()` ensures that if there was an exception only the scoreboard reservation will be released, and no register value will be written.

If there was no Execute Control exception, we redirect the Fetch stage if mispredicted. At this point we know both what was predicted as the successor PC and what is the actual successor PC for this instruction, so we can compare these to determine if there was a misprediction.

If there was an Execute Control exception, we save relevant values in registers and move from `MODE_PIPE` to `MODE_EXCEPTION` which will enable `rl_exception` to perform the required actions.

### 17.10.5 Rule to retire from the Execute Integer path

The following rule handles LUI, AUIPC and Integer instructions that arrive on the Execute Integer path. Note, none of these standard RISC-V instructions raise any exceptions. However, in this rule we assume the possibility of an instruction in case, in future, we extend Fife’s supported ISA with new non-standard Integer instructions that could raise an exception.

```

src_Fife/S5_Retire.bsv: line 338 ...
1 rule rl_Retire_EX_Int ((rg_mode == MODE_PIPE)
2   && (! wrong_path)
3   && is_Int);
4   f_RR_to_Retire.deq;
5   EX_to_Retire x2 <- pop_o (to_FIFOF_0 (f_EX_Int_to_Retire));
6

```

```

7   // Unreserve/commit rd if needed
8   fa_update_rd (x_rr_to_retire, (! x2.exception), x2.data);
9
10  if (! x2.exception) begin
11    // Redirect Fetch PC if mispredicted
12    Bool mispredicted = (x_rr_to_retire.predicted_pc
13                           != x_rr_to_retire.fallthru_pc);
14    fa_redirect_Fetch (mispredicted,
15                        x_rr_to_retire,
16                        x_rr_to_retire.fallthru_pc);
17    csrs.ma_incr_instret;
18  end
19  else begin
20    rg_epc    <= x_rr_to_retire.pc;
21    rg_cause <= x2.cause;
22    rg_tval   <= x2.tval;
23    rg_mode  <= MODE_EXCEPTION;
24  end
25  ...
26 endrule

```

We dequeue the direct-path information (`f_RR_to_Retire`) and we pop the Execute Int path information (`f_EX_Int_to_Retire`).

We invoke `fa_update_rd()` to update the destination register and its scoreboard reservation. Recall from Section 17.10.1 that `fa_update_rd()` will not send any update message if the instruction does not have an rd (or if rd is 0). The “`(!x2.exception)`” argument to `fa_update_rd()` ensures that if there was an exception only the scoreboard reservation will be released, and no register value will be written.

If there was no Execute Integer exception, we redirect the Fetch stage if mispredicted. We can compare the predicted successor PC for this instruction with its fall-through PC to determine if there was a misprediction.

If there was an Execute Integer exception, we save relevant values in registers and move from `MODE_PIPE` to `MODE_EXCEPTION` which will enable `rl_exception` to perform the required actions.

#### 17.10.6 Rules to retire from the Execute DMem path, or perform deferred DMem request

The following rules handle instructions that arrive on the Execute DMem path. This path:

- could have raised an exception (bad address, misaligned, memory-permission violation);
- or, it could have performed a speculative LOAD/STORE/AMO successfully, but if it wrote to memory, the value is still sitting in the store-buffer and needs to be committed;
- or, it could have been deferred because the memory address was a non-memory-like location (*e.g.*, MMIO) that did not allow speculation.

The first two cases are handled by the first rule below; the deferred case is handled by rules described subsequently.

### 17.10.6.1 Retire speculative and exception from DMem

The following rule handles incoming DMem exceptions and successful speculative results.

```

src_Fife/S5_Retire.bsv: line 369 ...
1   rule rl_Retire_EX_DMem ((rg_mode == MODE_PIPE)
2     && (! wrong_path)
3     && is_DMem
4     && (f_DMem_S_rsp.first.rsp_type != MEM_REQ_DEFERRED));
5
6   f_RR_to_Retire.deq;
7   let x2 <- pop_o (to_FIFOF_0 (f_DMem_S_rsp));
8
9   Bool exception = (x2.rsp_type != MEM_RSP_OK);
10
11  // Unreserve/commit rd if needed
12  fa_update_rd (x_rr_to_retire, (! exception), truncate (x2.data));
13
14  if (! exception) begin
15    // Send 'commit' (True) to store-buf, if needed
16    fa_retire_store_buf (x_rr_to_retire, x2, True);
17
18    // Redirect Fetch PC if mispredicted
19    Bool mispredicted = (x_rr_to_retire.predicted_pc
20                          != x_rr_to_retire.fallthru_pc);
21    fa_redirect_Fetch (mispredicted,
22                      x_rr_to_retire,
23                      x_rr_to_retire.fallthru_pc);
24    csrs.ma_incr_instret;
25  end
26  else begin
27    rg_epc    <= x_rr_to_retire.pc;
28    rg_cause <= ((x2.rsp_type == MEM_RSP_MISALIGNED)
29                  ? (is_LOAD (x_rr_to_retire.instr)
30                     ? cause_LOAD_ADDRESS_MISALIGNED
31                     : cause_STORE_AMO_ADDRESS_MISALIGNED)
32                  : (is_LOAD (x_rr_to_retire.instr)
33                     ? cause_LOAD_ACCESS_FAULT
34                     : cause_STORE_AMO_ACCESS_FAULT));
35    rg_tval  <= truncate (x2.addr);
36    rg_mode <= MODE_EXCEPTION;
37  end
38  ...
39 endrule

```

Note that the rule condition excludes deferred requests (they are handled in another rule, described in the next section).

We dequeue the direct-path information (`f_RR_to_Retire`) and we pop the Execute DMem path information (`f_EX_DMem_S_Rsp`).

We invoke `fa_update_rd()` to update the destination register and its scoreboard reservation. Recall from Section 17.10.1 that `fa_update_rd()` will not send any update message if the instruction does not have an rd (or if rd is 0). The “`(!x2.exception)`” argument to `fa_update_rd()` ensures that

if there was an exception only the scoreboard reservation will be released, and no register value will be written.

If there was no Execute Dmem exception, we send a “commit” message to the store buffer and redirect the Fetch stage if mispredicted. Note that “`fa_retire_store_buf()`” will send a commit message only if the instruction wrote memory (Section 17.10.1). We can compare the predicted successor PC for this instruction with its fall-through PC to determine if there was a misprediction.

If there was an Execute DMem exception, we save relevant values in registers and move from `MODE_PIPE` to `MODE_EXCEPTION` which will enable `rl_exception` to perform the required actions. The RISC-V “cause code” is computed based on the kind of DMem exception and the kind of instruction. We do not need to use “`fa_retire_store_buf()`” to send a “discard” message to the store-buffer because a DMem instruction, if it raises an exception, will not perform any speculative stores.

#### Exercise 17.7:

Inside `rl_Retire_EX_DMem` we invoke “`fa_update_rd()`” always, but we invoke “`f_retire_store_buf()`” only when there was no exception. Justify this.

□

#### 17.10.6.2 Rule to handle deferred DMem requests (from Execute DMem path)

The following rule handles memory requests that were deferred by the Execute DMem stage because there could not be performed speculatively.

```

1      src_Fife/S5_Retire.bsv: line 413 ...
2      rule rl_Retire_DMem_deferred ((rg_mode == MODE_PIPE)
3          && (! wrong_path)
4          && is_DMem
5          && (f_DMem_S_rsp.first.rsp_type
6              == MEM_REQ_DEFERRED));
7      let x2 <- pop_o (to_FIFOF_O (f_DMem_S_rsp));
8
9      // Issue DMem request
10     let mem_req = Mem_Req{inum:      x2.inum,
11                           pc:        x2.pc,
12                           instr:    x2.instr,
13                           req_type: x2.req_type,
14                           size:      x2.size,
15                           addr:      x2.addr,
16                           data:      x2.data};
17     f_DMem_req.enq (mem_req);
18     rg_mode <= MODE_DMEM_RSP;    // go to await response
19
20     ...
21 endrule

```

We pop the Execute DMem result (`f_DMem_S_rsp`). We do not dequeue the direct-path (`f_RR_to_Retire`) because we will need that information when we later handle the DMem response; we leave it at the head of the FIFO.

We merely construct the memory request now, and send it to memory (enqueue it on `f_DMem_req`). We also switch out of `MODE_PIPE` into `MODE_DMEM_RSP` which enables rule `rl_Retire_DMem_rsp` (described next) to await the memory response.

### 17.10.6.3 Rules to retire responses for deferred DMem requests

The following rule handles the corresponding responses from memory.

```

src_Fife/S5_Retire.bsv: line 437 ...
1  rule rl_Retire_DMem_rsp (rg_mode == MODE_DMEM_RSP);
2    f_RR_to_Retire.deq;
3    let x2 <- pop_o (to_FIFOF_O (f_DMem_rsp));
4
5    Bool exception = ((x2.rsp_type == MEM_RSP_ERR)
6                      || (x2.rsp_type == MEM_RSP_MISALIGNED));
7    if (exception) begin
8      rg_epc   <= x_rr_to_retire.pc;
9      rg_cause <= ((x2.rsp_type == MEM_RSP_MISALIGNED)
10                  ? (is_LOAD (x_rr_to_retire.instr)
11                     ? cause_LOAD_ADDRESS_MISALIGNED
12                     : cause_STORE_AMO_ADDRESS_MISALIGNED)
13                  : (is_LOAD (x_rr_to_retire.instr)
14                     ? cause_LOAD_ACCESS_FAULT
15                     : cause_STORE_AMO_ACCESS_FAULT));
16      rg_tval  <= truncate (x2.addr);
17      rg_mode <= MODE_EXCEPTION;
18    end
19    else if (x2.rsp_type == MEM_REQ_DEFERRED) begin
20      ...
21      // IMPOSSIBLE. Non-speculative requests cannot be deferred
22      $finish (1);
23    end
24    else begin
25      // Unreserve/commit rd if needed
26      fa_update_rd (x_rr_to_retire, True, truncate (x2.data));
27
28      // Redirect Fetch to correct mispredicted PC
29      Bool mispredicted = (x_rr_to_retire.predicted_pc
30                            != x_rr_to_retire.fallthru_pc);
31      fa_redirect_Fetch (mispredicted,
32                         x_rr_to_retire,
33                         x_rr_to_retire.fallthru_pc);
34      csrs.ma_incr_instret;
35
36      // Resume pipeline behavior
37      rg_mode <= MODE_PIPE;
38    end
39    ...
40  endrule

```

We can now dequeue the direct-path information (`f_RR_to_Retire`) and the memory-response (`f_DMem_Rsp`).

If the response returned an exception, we compute the RISC-V exception `cause` code, save values in registers, and move to `MODE_EXCEPTION` so that `rl_exception` can handle it subsequently.

The memory system should *never* return a `MEM_REQ_DEFERRED` for these requests through the non-speculative memory interface; responses can only be `OK` or an exception. The middle “`else-if`” clause is just a bit of defensive programming in case of a buggy memory system.

If the response was OK, we invoke `fa_update_rd()` to update the destination register and its scoreboard reservation. Recall from Section 17.10.1 that `fa_update_rd()` will not send any update message if the instruction does not have an rd (or if rd is 0). The second argument is True because this is not the exception case.

Finally, we redirect the Fetch stage if mispredicted, and return to `MODE PIPE` to resume pipeline processing. We can compare the predicted successor PC for this instruction with its fall-through PC to determine if there was a misprediction.

#### Exercise 17.8:

This rule fields a memory response, but does not send any commit/discard message to the store-buffer. Why not?



#### 17.10.7 Common Rule to handle exceptions

This is a common rule used to handle exceptions.

```
src_Fife/S5_Retire.bsv: line 485 ...
1  rule rl_exception (rg_mode == MODE_EXCEPTION);
2      Bool is_interrupt = False;
3      Bit #(XLEN) tvec_pc <- csrs.mav_exception (rg_epc,
4                                              is_interrupt,
5                                              rg_cause,
6                                              rg_tval);
7      fa_redirect_Fetch (True, x_rr_to_retire, tvec_pc);
8      rg_mode <= MODE_PIPE;
9      log_Retire_exception (rg_flog, x_rr_to_retire, rg_epc, is_interrupt, rg_cause, rg_tval);
10     endrule
```

In all the previous rules, when an exception is detected, we save relevant values in registers `rg_epc`, `rg_cause` and `rg_tval` and switch the module mode to `MODE_EXCEPTION`, which enables this rule. It invokes `csrs.mav_exception()` which updates the relevant CSRs, and returns the value CSR `mtvec` (trap-vector PC). Here, we simply redirect to that PC, and return to pipeline processing in `MODE_PIPE`.

#### 17.10.8 Fife module interface definition

The final INTERFACE section, as with other modules, has an `init` method, and then lifts the various FIFO interfaces into sub-interfaces for this module.

```
src_Fife/S5_Retire.bsv: line 496 ...
1  // =====
2  // INTERFACE
3
4  method Action init (Initial_Params initial_params);
5      csrs.init (initial_params);
6      rg_flog <= initial_params.flog;
7  endmethod
8
9  // Forward in
10 interface fi_RR_to_Retire      = to_FIFOF_I (f_RR_to_Retire);
```

```

11   interface fi_EX_Control_to_Retire = to_FIFOF_I (f_EX_Control_to_Retire);
12   interface fi_EX_Int_to_Retire      = to_FIFOF_I (f_EX_Int_to_Retire);
13
14   // DMem, speculative
15   interface fi_DMem_S_rsp      = to_FIFOF_I (f_DMem_S_rsp);
16   interface fo_DMem_S_commit = to_FIFOF_O (f_DMem_S_commit);
17
18   // DMem, non-speculative
19   interface fo_DMem_req = to_FIFOF_O (f_DMem_req);
20   interface fi_DMem_rsp = to_FIFOF_I (f_DMem_rsp);
21
22   // Backward out
23   interface fo_Fetch_from_Retire = to_FIFOF_O (f_Fetch_from_Retire);
24   interface fo_RW_from_Retire   = to_FIFOF_O (f_RW_from_Retire);
25
26   // Set TIME
27   method Action set_TIME (Bit #(64) t) = csrs.set_TIME (t);
28 endmodule

```

## 17.11 Conclusion

And that is the complete Fife CPU! There is actually very little in this chapter that is RISC-V specific; these pipeline structures are needed for a pipelined CPU for *any* ISA. All our RISC-V-specific discussions were completed in the chapters leading up to Drum, and here we have simply reused all those RISC-V-specific common functions.

There is quite a lot scope for performance optimization here, even keeping the same pipeline structure. For example:

- The sooner we can deliver a redirection from Retire to Fetch, the less time the CPU will waste fetching and discarding wrong-path instructions.
- In a message from Retire to the Register-Read and Dispatch stage to update rd and release the scoreboard, the sooner we can deliver that value to any waiting instruction, the less time wasted in waiting on the scoreboard.
- In Retire, the `rl_exception` actions can be folded in to the rule where the exception is detected, avoiding another delay, except possibly in the CSRRxx case, where there may be contention with CSR access.

We will discuss how to do some of these optimizations in Chapter 18.

It is not straightforward to judge whether an optimization is worth it or not. Many optimizations add hardware cost (silicon area and energy consumption). Some optimizations, while reducing the number of clocks for a computation, may require slower clock speeds because of longer combinational paths, thus negating some of the speed advantage.

Because of the high variability in the mix of instructions across different programs, some optimizations may have more effect on some programs than on others.

Ultimately, the value of most optimizations cannot be judged in the abstract, only in specific contexts: What are the actual application programs that will be run? Does this optimization improve the performance of those optimizations? At what hardware and energy cost?



# Chapter 18

## BSV: Rules and Methods II: Improved performance with CRegs (Concurrent Registers)

### 18.1 Introduction

In Section 14.5.1 we discussed constraints on mapping rules into clocked digital hardware. In particular, executing two rules on the same clock edge may result in an ordering conflict, and therefore the compiler produces a Rule Controller to suppress simultaneous firing. This has some consequences on the performance of BSV programs, depending on the primitive modules they use. We illustrate this problem in Section 18.2, and then show a general-purpose solution in Section 18.3 using “*Concurrent Registers*” (or CRegs).

Finally, in Section 18.6 we show an older solution called RWires. We deprecate use of RWires in favor of CRegs, but they are still supported by the compiler (Drum and Fife codes do not use any RWires).

### 18.2 Example: Counter with .incr and .decr methods

Consider the following “Up-Down Counter” module interface:

```
1 interface Up_Down_Counter_IFC;
2     method Action incr;
3     method Action decr;
4     method Bit #(4) val;
5 endinterface
```

The specification for any module implementing this is this: internally, the module should contain a register that can hold values in the range 0..15, and, the methods do the following:

- `incr` increments the counter if it is < 15
- `decr` decrements the counter if it is > 0
- `val` returns the current value of the counter

Up-down counters are often used for “credit-based flow control”. For example, a networking application may send network packets and, concurrently, receive acknowledgments for previously sent packets, but the network may not accommodate more than 15 such packets. Suppose the register is initialized to 15. This represents the number of available “credits”, *i.e.*, the number of packets that can be sent without acknowledgement. We decrement it whenever a packet is sent, and increment it whenever an acknowledgement is received. Thus, it prevents us from sending more than 15 packets without acknowledgement.

Here is a proposed module to implement the interface:

```

1  module mkUp_Down_Counter_I (Up_Down_Counter_IFC);
2    // STATE
3    Reg #(Bit #(4)) rg_counter <- mkReg (15);
4
5    // -----
6    // INTERFACE
7    method Action incr (rg_counter != 15);
8      rg_counter <= rg_counter + 1;
9    endmethod
10
11   method Action decr; (rg_counter != 0);
12     rg_counter <= rg_counter - 1;
13   endmethod
14
15   method Bit #(4) val;
16     return rg_counter;
17   endmethod
18 endmodule

```

### 18.2.1 Semantic and Performance Analysis of `mkUp_Down_Counter_I` when mapped to clocked hardware

Suppose we have three rules, each invoking one of the three methods on the same module. Which pairs of rules can fire on the same clock?

- `val` and `incr`?

Yes: Clocked execution is consistent with rule-at-a-time semantics where the rule with `val` fires before the rule with `incr`. (And similarly, a rule invoking `val` can fire on the same clock as a rule invoking `decr`.)

- `incr` and `decr`?

No, for both reasons discussed in Section 14.5.1. There would be an action/resource conflict because both rules cannot write the same register at the same instant. There would be an ordering conflict because rule-at-a-time semantics demands that one rule’s update is visible to the other (in whichever order they may fire).

The key takeaway is that `incr` and `decr` cannot be invoked on the same clock (same instant). To revisit our credit-based flow-control example: on each clock we can either send a packet or receive an acknowledgement, but not both.

### 18.3 Concurrent Registers CRegs, and a Faster Up-Down Counter

A Concurrent Register, or `CReg`, is a module provided by the `bsc` library. This is an excerpt from the library documentation:

```

1 module mkCReg #(parameter Integer n,
2                     parameter a_type resetval)
3                     (Reg#(a_type) ifc[])

```

The interface is an *array* of register interfaces, indicated by the square brackets. A `CReg` module is instantiated with syntax like this:

```

1 // parameter n is 2, resetval is 15
2 Array #(Reg #(Bit #(4))) crg_counter <- mkCReg (2, 15);

```

This instantiates a `CReg` whose interface is an array of 2 register interfaces, and whose reset value is 15. Each interface can be accessed by indexing: `crg_counter [0]` and `crg_counter [1]`.

The key properties of a `CReg` `x` with an array of `n` register interfaces is:

- All the methods can be invoked in the same clock.
- A read at the  $j$ 'th register interface, *i.e.*, `x[j]._read` returns the latest of:
  - the value in the register;
  - if `x[0]._write(v_0)` is being invoked, the value  $v_0$ ;
  - if `x[1]._write(v_1)` is being invoked, the value  $v_1$ ;
  - ...
  - if `x[j - 1]._write(v_{n-1})` is being invoked, the value  $v_{n-1}$ ;
- The register value is updated with the latest of:
  - the current value in the register;
  - if `x[0]._write(v_0)` is being invoked, the value  $v_0$ ;
  - if `x[1]._write(v_1)` is being invoked, the value  $v_1$ ;
  - ...
  - if `x[n - 1]._write(v_{n-1})` is being invoked, the value  $v_{n-1}$ ;

#### 18.3.1 A possible hardware implementation of a `CReg`

Figure 18.1 shows a possible hardware implementation of a `CReg`. On the left is a regular D flip flop. To its right is a combinational circuit comprised of a series of  $n$  multiplexers, where  $n$  is the interface array size for the `CReg`. The top and bottom depict the  $n$  `.write` and `.read` methods, respectively.

The  $j^{th}$  multiplexer selects either the output of the previous element (D flip flop or multiplexer), or the data argument of the `[j]._write` method. The selection is controlled by the EN input of the `[j]._write` method. Thus, if the `[j]._write` method is currently being invoked, that data is fed onward, otherwise the data from the previous element.

The `[j]._read` method returns the value before the  $j$ 'th multiplexer.

A little study of the diagram will show that it indeed implements the semantics described in the previous section.



Figure 18.1: A possible hardware implementation of a CReg

### 18.3.2 A Faster Up-Down Counter, using a CReg

Here is an alternative module implementing our Up-Down Counter interface, using a CReg:

```

1 module mkUp_Down_Counter_I (Up_Down_Counter_IFC);
2   // STATE
3   Array #(Reg #(Bit #(4))) rg_counter <- mkCReg (2,15);
4
5   // -----
6   // INTERFACE
7   method Action incr (rg_counter != 15);
8     rg_counter [0] <= rg_counter [0] + 1;
9   endmethod
10
11  method Action decr; (rg_counter != 0);
12    rg_counter [1] <= rg_counter [1] - 1;
13  endmethod
14
15  method Bit #(4) val;
16    return rg_counter [0];
17  endmethod
18 endmodule

```

Now, three rules invoking the three methods can all be invoked on the same clock. Further, when they are invoked in the same clock, it answers the question, “What value does `val` return?”:

- The original value in the CReg?
- Or the value after the increment?
- Or the value after the decrement?
- Or the value after the increment and decrement?

Per the semantics of `CReg`, it returns `rg_counter[0]`, which is the original value in the CReg. If we were to return `rg_counter[1]`, it would return the value after the increment and before the decrement. If our CReg had an array of 3 register interfaces instead of 2, we could return `rg_counter[2]` which would be the value after both the increment and decrement.

## 18.4 Example: Using a CReg for the RISC-V CSR `mcycle`

One of the RISC-V CSRs is `mcycle`. This CSR is incremented by the hardware automatically on every clock, to count total clock cycles. But this CSR can also be written from RISC-V code using a CSRRxx instruction (which was described in Section 2.7.2). Suppose the CSR contains the value  $n$ , and we are executing a CSRRxx instruction that wants to write  $m$  to the CSR. After the clock instant, should the value be  $n+1$ ? Or  $m$ ? Or  $m+1$ ? Or something else? The RISC-V specification says it should be  $m$ , the value written by the CSRRxx instruction.

This can be cleanly expressed using a CReg.

```
1      _____ from src_Common/CSRs.bsv _____
2      Array #(Reg #(Bit #(64))) csr_mcycle <- mkCReg (2, 0);
```

The automatic increment in every cycle is performed by this rule in the CSRs module:

```
1      _____ from src_Common/CSRs.bsv _____
2      rule rl_count_cycles;
3          csr_mcycle [0] <= csr_mcycle [0] + 1;
4      endrule
```

To execute a CSRRxx instruction, a rule in the CPU invokes a method in the CSRs module which contains this code:

```
1      _____ from src_Common/CSRs.bsv _____
2      csr_mcycle [1] <= csr_val;
```

The use of the indexes [0] and [1] ensure that when a CSRRxx instruction is executed, its value will override the incremented value written by `rl_count_cycles`.

## 18.5 PipelineFIFOs and BypassFIFOs

Consider the following module implementing a 1-element FIFO with a FIFOF interface:

```
1  module mkFIFOF (FIFOF #(Bit #(32)));
2      Reg #(Bit #(32)) rg_data <- mkRegU;
3      Reg #(Bool)      rg_full <- mkReg (False);
4
5      // -----
6      // INTERFACE
7
8      method Bool notEmpty ();
9          return rg_full;
10         endmethod
11
12     method Bit #(32) first () if (rg_full);
13         return rg_data;
14         endmethod
15
16     method Action deq () if (rg_full);
17         rg_full <= False;
18         endmethod
19
```

```

20   method Bool notFull ();
21     return (! rg_Full);
22   endmethod
23
24   method Action enq (Bit #(32) x) if (! rg_full);
25     rg_data <= x;
26     rg_full <= True;
27   endmethod
28
29   method Action clear;
30     rg_full <= False;
31   endmethod
32 endmodule

```

Consider a producer rule `rl_P` that invokes `enq`, and a consumer rule `rl_C` that invokes `first` and `deq`, as shown in Figure 18.2.



Figure 18.2: A producer and consumer connected with a FIFO

These rules cannot fire at the same instant (on the same clock) because both of them read and write register `rg_Full` (because both `enq` and `deq` read and write the register).

Let us use a CReg to relax this constraint, in two different ways.

### 18.5.1 PipelineFIFO

```

1 module mkPipelineFIFO (FIFO #(Bit #(32)));
2   Reg #(Bit #(t)) rg_data <- mkRegU;
3   Array #(Reg #(Bool)) crg_full <- mkCReg (3, False);
4
5   // -----
6   // INTERFACE
7
8   method Bool notEmpty ();
9     return crg_full [0];
10    endmethod
11
12   method Bit #(32) first () if (crg_full [0]);
13     return rg_data;
14   endmethod
15
16   method Action deq () if (crg_full [0]);
17     crg_full [0] <= False;
18   endmethod
19
20   method Bool notFull ();

```

```

21     return (! crg_Full [1]);
22 endmethod
23
24 method Action enq (Bit #(32) x) if (! crg_Full [1]);
25     rg_data      <= x;
26     crg_Full [1] <= True;
27 endmethod
28
29 method Action clear;
30     crg_Full [2] <= False;
31 endmethod
32 endmodule

```

Again, consider a producer rule `rl_P` that invokes `enq`, and a consumer rule `rl_C` that invokes `first` and `deq`. Now, these rules *can* fire together at the same instant (on the same clock).

Because of the choice of indexes we can see that, in the equivalent rule-at-a-time semantics, `rl_C` fires *before* `rl_P`. Thus, even if the FIFO was full at the start of the clock, `rl_P` can still `enq` into the FIFO, *provided* `rl_C` is firing on the same clock. Per the rule-at-a-time semantics, `rl_C` fires first, which empties the FIFO, *i.e.*, `rl_P` sees the FIFO as empty and is therefore able to `enq` into it.

As a separate observation, we can see that another rule invoking `clear`, too, can fire on the same clock as `rl_P` and `rl_C` and, because of its index, logically fires “last”, and so will leave the FIFO in a finally empty state.

If we examine the Verilog generated for `mkPipelineFIFO`, we will see that the READY signal for `enq` incorporates the ENABLE signal of the `deq` method (because, if the FIFO is full at the previous clock, in this clock `enq` can only be invoked if `deq` is also being invoked). Thus, there is a *combinational path* backward through the FIFO (a path that only involves wires and gates and no state-element) from the `deq` method to the `enq` method. This is illustrated in Figure 18.3.



Figure 18.3: Combinational path through `mkPipelineFIFO`

This kind of FIFO is called a “Pipeline FIFO” because it is an ideal candidate for the FIFO between stages of a pipeline. It allows the downstream stage (the consumer) and the upstream stage (the producer) to fire on the same clock, advancing data in a pipelined manner. It is so useful that it is available in the `bsc` library in the `SpecialFIFOs` package.

### 18.5.2 BypassFIFO

In this version, we change the CReg indexes used by the methods.

```

1 module mkPipelineFIFO (FIFO #(Bit #(32)));
2     Array #(Reg #(Bit #(32))) crg_data <- mkCRegU (2);
3     Array #(Reg #(Bool))      crg_full <- mkCReg (3, False);
4
5 // -----

```

```

6 // INTERFACE
7
8     method Bool notEmpty ();
9         return crg_Full [1];
10    endmethod
11
12    method Bit #(32) first () if (crg_Full [1]);
13        return crg_data [1];
14    endmethod
15
16    method Action deq () if (crg_Full [1]);
17        crg_Full [1] <= False;
18    endmethod
19
20    method Bool notFull ();
21        return (! crg_Full [0]);
22    endmethod
23
24    method Action enq (Bit #(32) x) if (! crg_Full [0]);
25        crg_data [0] <= x;
26        crg_Full [0] <= True;
27    endmethod
28
29    method Action clear;
30        crg_Full [0] <= False;
31    endmethod
32 endmodule

```

Again, consider a producer rule `r1_P` that invokes `enq`, and a consumer rule `r1_C` that invokes `first` and `deq`. These rules *can* fire together at the same instant (on the same clock).

Because of the choice of indexes we can see that, in the equivalent rule-at-a-time semantics, `r1_P` fires *before* `r1_C`. Thus, even if the FIFO was empty at the start of the clock, `r1_C` can still read `first`, and `deq` the FIFO, *provided* `r1_P` is firing on the same clock. Per the rule-at-a-time semantics, `r1_P` fires first, which fills the FIFO, *i.e.*, `r1_C` sees the FIFO as full and is therefore able to use `first` and `deq`.

As a separate observation, we can see that another rule invoking `clear`, too, can fire on the same clock as `r1_P` and `r1_C` and, because of its index, logically fires “last”, and so will leave the FIFO in a finally empty state.

If we examine the Verilog generated for `mkBypassFIFO`, we will see that the READY signal for `first` and `deq` incorporates the ENABLE signal of the `enq` method (because, if the FIFO is empty at the previous clock, in this clock `first` and `deq` can only be invoked if `enq` is also being invoked). Further, when `enq` and `first` are invoked in the same clock, the data argument to `enq` is passed straight through to the data result of `first`. Thus, there are several *combinational paths* forward through the FIFO (paths that only involve wires and gates and no state-element) from the `enq` method to the `first` and `deq` methods. This is illustrated in Figure 18.4.



Figure 18.4: Combinational paths through `mkBypassFIFO`

This kind of FIFO is called a “Bypass FIFO” because of the way data can “bypass” the FIFO from `enq` to `first`. It is so useful that it is available in the `bsc` library in the `SpecialFIFOs` package.

### 18.5.3 Back-to-back compositions of BypassFIFO and PipelineFIFO

An interesting component is a back-to-back composition of the two FIFOs discussed in the previous sections. Consider this code fragment:

```

1 module mk... (...);
2     FIFO #(Bit #(32)) f_bypass    <- mkBypassFIFO();
3     FIFO #(Bit #(32)) f_pipeline <- mkPipelineFIFO();
4
5     // Producer rule (into f_bypass's enq side)
6     rule rl_P;
7         ... f_bypass.enq (x);
8     endrule
9
10    // Connect f_bypass's first/deq side to to f_pipeline's enq side
11    mkConnection (to_FIFO_0 (f_bypass), to_FIFO_I (f_pipeline));
12
13    // Consumer rule (from f_pipeline's first/deq side)
14    rule rl_C;
15        let y = f_pipeline.first;
16        f_pipeline.deq;
17    endrule
18 endmodule

```

This is illustrated in Figure 18.5. This composition has some pleasant properties:



Figure 18.5: A producer and consumer connected with a composed BypassFIFO-PipelineFIFO.

- Normally, if we connect two FIFOs back-to-back like this, it will always take a minimum of two ticks for a datum to traverse through (from `enq` to `first/deq`—one tick in the first FIFO and one tick in the second. With this composition on the other hand, because of the Bypass FIFO, it can traverse in a single tick.
- Like the ordinary `bsc` library `mkFIFO`, and unlike `mkPipelineFIFO` and `mkBypassFIFO` by themselves, the producer rule `rl_P` side and the consumer rule `rl_C` have no ordering constraints (they are conflict free). They can fire in the same clock and can go in either logical order.
- There are no combinational paths through the pair of FIFOs! One can verify this by generating the Verilog for the composed FIFOs and analyzing it (which, fortunately, does not have to be done by hand, because when the `bsc` compiler generates Verilog for a module, it helpfully reports combinational paths, if any, through the module).

These properties make it very attractive to use this composition to connect stages in a pipeline, enabling a *modular* separation of stages. We can place `mkBypassFIFO` in the module for the upstream stage, place `mkPipelineFIFO` in the module for the downstream stage, and connect them in the parent module with `mkConnection`. This is illustrated in Figure 18.6. We use this structure



Figure 18.6: Pipeline-stage modularity enabled by composing a BypassFIFO and a PipelineFIFO

extensively and exclusively in Fife to connect its stages.

The only downside is that the composition has two data registers, one each in `mkBypassFIFO` and `mkPipelineFIFO`. In many applications, this does not matter much, *i.e.*, the overall state size is dominated by other components, and this addition is in the noise.

#### Exercise 18.1:

What happens if we reverse the composition—use `mkPipelineFIFO` upstream and `mkBypassFIFO` downstream?

- How many ticks will it take for a datum to traverse the pair?
- Will `rl_P` and `rl_C` still be conflict-free (and therefore can fire in either logical order)?
- Are there any combinational paths through the pair?
- Consider combinational paths in rule `rl_P` ending at the `enq` method of the first FIFO, and combinational paths in rule `rl_C` starting at the `first` method of the second FIFO. What is the difference between these when we compose `mkBypassFIFO` → `mkPipelineFIFO` vs. `mkPipelineFIFO` → `mkBypassFIFO`?

□

## 18.6 Alternatives to CReg: RWires and their variants (deprecated)

In digital hardware,

- a register communicates a value across clocks (from one clock to succeeding clocks), and
- a wire communicates a value within a clock (from state element to gate, from gate to gate, and from gate to state element).

A BSV register is a standard digital hardware register. A BSV CReg plays the role of both register and wires, communicating values flexibly within a clock and across clocks, depending on whether

its methods are invoked within a clock or across clocks. The concept of CRegs was first proposed by Daniel Rosenband and Arvind at MIT [20, 21] (they were originally called “Ephemeral History Registers” or EHRs).

Before the invention of CRegs, BSV had a facility to manage intra-clock communication, called the “RWire”. These are still available in BSV, along with several specializations: “Wire”, “BypassWire”, “DWire”, and “PulseWire”. Please see the *bsc* library documentation for detailed information.

CRegs are semantically preferable to RWires because they fit cleanly into rule-at-a-time semantics, where a CReg can be treated as an ordinary register. The semantics of a BSV program with CRegs can be explained purely at the rule-at-a-time level, with no mention of clocks. RWires, on the other hand, are only meaningful with clocks, and are therefore semantically messier.

Since the inclusion of CRegs into the *bsc* library, the need for `RWire` and its variants has mostly disappeared.



# Chapter 19

## RISC-V: Optimizing Drum and Fife

### 19.1 Introduction

There are three physical dimensions on which we may want to optimize a CPU design:

- Time (performance): The speed (wall-clock time) at which a desired application is executed by the CPU.
- Space (area/resources): On ASICs, the silicon area occupied by the design. For ASICs this may be measured in square millimeters or in “number of gates”. For FPGAs this may be measured in LUTs (Lookup Tables), BRAMs, DSPs and so on. Smaller is better, for many reasons (better silicon yields, lower power consumption, *etc.*..)
- Energy: the amount of energy consumed to execute a desired application. Smaller is better.

These are not independent dimensions; improving one dimension often costs something in another dimension. Ultimately, a competitive product specification for a particular target market will define the acceptable boundaries for each dimension.

We will not say much in this book about energy optimization, even though it is of course increasingly important in modern times, both in the macro sense (reduce energy consumption for a greener planet) and in the micro sense (mobile devices and IoT (Internet of Things) devices work on battery power or energy harvested from the environment). Modern designs have many techniques to slow down or switch off clocks, and reduce circuit-supply voltage for less critical or idle circuits, both of which reduce energy consumption.

Regarding Performance, it is important to keep in mind that the ultimate number that matters is *application performance*, *i.e.*, how long does it take to execute an application of interest. One will often hear other numbers cited as, such as clock speed, instructions per clock (IPC) or its inverse clocks per instruction (CPI), or total number of instructions. All these contribute to application performance; they do not individually determine application performance. Conceptually, the total execution time for an application is:

$$\langle \text{exec time} \rangle = \langle \text{total number of instrs} \rangle \times 1/\langle \text{clock-speed} \rangle \times \text{CPI}$$

the units being:

$$\text{seconds} = \text{instructions} \times \text{seconds/cycle} \times \text{cycles/instruction}$$

For a given application and a given set of input data, total number of instructions is a function of the ISA and compiler quality. ISAs like x86 or the older DEC PDP-11, DEC Vax, and Motorola 68000 were *Complex Instruction Sets*, where a single “CISC” instruction could express more work, such as a memory read, an integer op, and a memory write. RISC-V is a *Reduced Instruction Set*, where that same work needs separate “RISC” instructions for LOAD, Integer and STORE. And, of course, a better compiler may produce a smaller program (fewer instructions) from the same source code.

CPI is not a constant. First, it may vary inherently across different kinds of instructions—an integer ALU instruction may take fewer cycles than a memory-access instruction or a floating-point instruction. Second, even for a particular kind of instruction, CPI may vary; for example, the number of cycles for a memory-access instruction may depend on hits/misses in caches, hits/misses in virtual memory TLBs (Translation Lookaside Buffers), *etc.*. Third, an integer that reads a register may stall for a number of cycles because a preceding instruction has not yet written the register. Thus, in the above formula, CPI is just an average CPI across the application.

Clock speed may not be constant; in some implementations, clocks are slowed down, or even switched off for non-performance-critical components during intervals that do not demand high performance (*e.g.*, “idling”).

The terms in the formula above need simultaneously to be optimized for the best product. They are not independent; higher clock speeds restrict how much “circuit work” can be done in a single clock which, in turn, affects microarchitecture (*e.g.*, may need to split an operation into multiple pipeline stages); which, in turn, can affect instructions/cycle. A CISC ISA may get more work done with fewer instructions, but a RISC ISA may be implementable with much higher clock speeds and more pipeline parallelism.

Instructions/cycle depends on microarchitecture. The more parallelism we can exploit, the higher the instructions/cycles that can be achieved. A common term to refer to this measure is “ILP” (Instruction-Level Parallelism).

For example, in both Drum and Fife, any particular instruction takes five or more cycles, from Fetch to Retire. Drum, having an FSM microarchitecture, therefore retires one instruction every five or more cycles. Fife, which executes many instructions in parallel in its pipelined microarchitecture, can often retire one instruction per cycle.

An advanced microarchitecture with more parallelism is the *superscalar* microarchitecture, which may fetch and execute 2 or 4 (or more) instructions at a time.

Another advanced microarchitecture with more parallelism is the *out-of-order* microarchitecture which may, for example, have multiple integer execution units, and allow multiple integer instructions to execute at a time, or even out-of-order depending on when their input data is available. The cache in their DMem may support “non-blocking caches” or “hits-under-misses” *i.e.*, for two memory accesses  $a_1$  and  $a_2$  that arrive in that order, it permits  $a_2$  to be serviced while  $a_1$  may be awaiting a cache-refill due to a cache-miss.

Superscalarity and Out-of-order microarchitectures are beyond the scope of this book. In the rest of this chapter we will discuss improving the resources and performance of Drum and Fife.

## 19.2 Pipeline traces and visualization to analyze performance

Before trying to optimize a CPU, we must first analyze and understand the existing implementation thoroughly so that we can then identify opportunities for improvement.

The first step is to produce a *pipeline trace*, which has more fine-grain detail than a mere *instruction trace* (which only records the trace of instructions retired). We record, for every instruction, its transit through *each* step/stage of the FSM (Drum) or pipeline (Fife).

For example, if we examine the Fife code for rule `r1_Decode` we see an invocation of a chain of functions:

```
src_Fife/S2_Decode.bsv
log_Decode (rg_flog, y, rsp_IMem);
```

→

```
src_Common/Fn_Decode.bsv
function Action log_Decode (File flog, Decode_to_RR y, Mem_Rsp rsp_IMem);
...
ftrace (flog, y.inum, y.pc, y.instr, "D", $format(""));
...
```

→

```
src_Common/Utils.bsv
function Action ftrace (File      flog, ...)
...
$fdisplay (flog, "Trace %0d %0d %0h %0h %s", cur_cycle,
           inum, pc, instr, label, adhoc);
```

This writes a line like this into the logfile produced by Drum and Fife when they are simulated:

```
log.txt
...
Trace 6 2 80000008 fe012e23 D
...
```

which records the fact that on clock tick 6 (cycle 6), the 2<sup>nd</sup> instruction, whose PC is 0x8000\_0008, and whose 32 bits are 0xfe01\_2e23, was processed in the Decode stage. Similarly, we write a trace item for every interesting microarchitectural event for each instruction.

By studying this code, manually or with analysis software, we can get an idea of exactly how many cycles each instruction takes, and where we may be “losing cycles”, if any.

This trace can also be processed and displayed in a “pipeline visualization” tool. If we take the pipeline trace `log.txt` produced by Drum or Fife, we can run it through the Python program `Log_to_CSV.py` provided with Drum and Fife:

```
log.txt
$ Tools/Log_to_CSV/Log_to_CSV.py  log.txt  0 100
```

This selects from instruction 0 through instruction 100 from the pipeline trace file, and produces a file `log.txt.csv` in the standard “comma-separated variables” data input format that is accepted by most spreadsheet programs (OpenOffice, Microsoft Excel, Mac Numbers, Google Sheets).

Figure 19.1 shows the display, in the Mac Numbers spreadsheet application, for the pipeline trace of up to the first few instructions for the “Hello World!” C program running on Drum and Fife, respectively:

In both displays, the vertical axis, going downwards, shows instruction number 1, 2, 3, ... The horizontal axis, going to the right, shows the PC and disassembled instruction in the first two columns, followed by clock ticks. For each instruction, we see its instruction number, PC and disassembly, and then the Drum step/Fife stage events at the clock tick where the event happened.

| inum | PC       | Instr                                                         | 4 | 5 | 10 | 5 | 20   | 5        | 30       | 5        | 40       | 5    |
|------|----------|---------------------------------------------------------------|---|---|----|---|------|----------|----------|----------|----------|------|
| 1    | 80000004 | SW MEM [sp(x2) + 0] := zero(x0) (class_STORE)                 |   |   | F  | D | RR.D | RET.Dreq | RET.Drsp |          |          |      |
| 2    | 80000008 | SW MEM [sp(x2) + fe0] := zero(x0) (class_STORE)               |   |   |    |   | F    | D        | RR.D     | RET.Dreq | RET.Drsp |      |
| 3    | 8000000C | LUI s0fp(x8) := 0000 (class_LUI)                              |   |   |    |   |      |          | F        | D        | RR.I     | EX.I |
| 4    | 80000010 | ADDI s0fp(x8) := s0fp(x8), a (class_ALU_I)                    |   |   |    |   |      |          |          | F        | D        | RR.I |
| 5    | 80000014 | CSRWR zero(x0) := mstatus (csr 300) := s0fp(x8) (class_CSRRx) |   |   |    |   |      |          |          |          | F        | D    |
| 6    | 80000018 | LUI s0fp(x8) := 8001_0000 (class_LUI)                         |   |   |    |   |      |          |          |          |          | F    |
| 7    | 8000001C | SW MEM [s0fp(x8) + c0] := zero(x0) (class_STORE)              |   |   |    |   |      |          |          |          |          |      |
| 8    | 80000020 | CSRWR zero(x0) := mtval (csr 305) := s0fp(x8) (class_CSRRx)   |   |   |    |   |      |          |          |          |          |      |
| 9    | 80000024 | AUIPC t1(x8) := PC+8000 (class_AUIPC)                         |   |   |    |   |      |          |          |          |          |      |
| 10   | 80000028 | ADDI t1(x8) := t1(x8), e4c (class_ALU_I)                      |   |   |    |   |      |          |          |          |          |      |

  

| inum | PC       | Instr                                                         | 3 | 5 | 10   | 5    | 20   | 5     | 30    | 5     |                   |                   |       |       |      |       |    |
|------|----------|---------------------------------------------------------------|---|---|------|------|------|-------|-------|-------|-------------------|-------------------|-------|-------|------|-------|----|
| 1    | 80000004 | SW MEM [sp(x2) + 0] := zero(x0) (class_STORE)                 | F | D | RR.S | RR.S | RR.D | RET.D |       |       |                   |                   |       |       |      |       |    |
| 2    | 80000008 | SW MEM [sp(x2) + fe0] := zero(x0) (class_STORE)               | F | D |      |      | RR.D | RET.D |       |       |                   |                   |       |       |      |       |    |
| 3    | 8000000C | LUI s0fp(x8) := 6000 (class_LUI)                              |   | F |      | D    | RR.I | EX.I  | RET.I | RW    |                   |                   |       |       |      |       |    |
| 4    | 80000010 | ADDI s0fp(x8) := s0fp(x8), a (class_ALU_I)                    |   |   | F    | D    | RR.S | RR.S  | RR.I  | RET.I | RW                |                   |       |       |      |       |    |
| 5    | 80000014 | CSRWR zero(x0) := mstatus (csr 300) := s0fp(x8) (class_CSRRx) |   |   |      | F    | D    |       | RR.S  | RR.S  | RR.dir RET.CSRRxx |                   |       |       |      |       |    |
| 6    | 80000018 | LUI s0fp(x8) := 8001_0000 (class_LUI)                         |   |   |      |      | F    | D     |       | RR.I  | EX.I              | RET.I             | RW    |       |      |       |    |
| 7    | 8000001C | SW MEM [s0fp(x8) + c0] := zero(x0) (class_STORE)              |   |   |      |      |      | F     | D     | RR.S  | RR.S              | RR.D              | RET.D |       |      |       |    |
| 8    | 80000020 | CSRWR zero(x0) := mtval (csr 305) := s0fp(x8) (class_CSRRx)   |   |   |      |      |      |       | F     | D     |                   | RR.dir RET.CSRRxx |       |       |      |       |    |
| 9    | 80000024 | AUIPC t1(x8) := PC+8000 (class_AUIPC)                         |   |   |      |      |      |       |       | F     | D                 | RR.I              | EX.I  | RET.I | RW   |       |    |
| 10   | 80000028 | ADDI t1(x8) := t1(x8), e4c (class_ALU_I)                      |   |   |      |      |      |       |       |       | D                 | RR.S              | RR.S  | RR.I  | EX.I | RET.I | RW |

Figure 19.1: Visualization of the per-instruction step/stage events in Drum and Fife

We can clearly see the difference between FSM sequenced (Drum) and pipelined (Fife) behavior. In Drum, after the first instruction has completed (“RET.Drsp” = Retire DMem Response) in tick 6, the Fetch (“F”) for the second instruction occurs in tick 7. In Fife, after the Fetch for the first instruction on tick 4, the Fetch for the second instruction occurs in tick 5, the very next tick.

Considering IPC, we can see that in Drum, we have barely completed 6 instructions at tick 46, whereas Fife is has finished 10 instruction by tick 33. The tradeoff, as suggested earlier, is that Drum takes far fewer resources (LUTs, gates) in silicon compared to Fife.

For each instruction, after its Fetch, horizontal gaps indicate lost cycles, *e.g.*, an instruction stalling at Register Read because it is waiting for an earlier instruction to write that register value, or a memory access waiting for the memory response. In Fife, we can also see which instructions were discarded (more lost cycles) due to PC misprediction or traps.

## 19.3 Optimization opportunities in Drum and Fife

The following sections discuss a number for optimization opportunities for Drum and Fife. Whether they actually produce improvements or not is usually difficult to predict just from observation or analysis, because an optimization’s improvement in one dimension may come at a cost in another and, further, optimizations may interact positively or negatively. Ultimately, one must implement the optimizations and *measure* the actual impacts on clock speed, IPC, application performance, resources, and energy consumption, *etc.* and then make a deployment choice based on actual product priorities.

Each of these opportunities is a potential project for the reader of this book, starting with the existing Drum or Fife code, for more practice with RISC-V and BSV. Almost all of them require no new BSV concepts beyond what has already been covered.

### 19.3.1 Drum and Fife: Fusing Decode and RR-Dispatch

More FSM steps in Drums or more stages in Fife (longer pipelines) will, in general, enable higher clock speeds (because the circuits in each step have less to do, and so have less delay). But more steps/stages increase the latency (number of ticks) of each instruction, and the resources (more register or FIFO buffering between stages).

In some cases it is worthwhile going in the opposite direction, fusing what used to be two steps/stages into one. A functionally easy pair to fuse is Decode and RR-Dispatch. In Drum, the Decode action looks like this:

```
src_Drum/CPU.bsv
Action a_Decode =
action
    let mem_rsp <- pop_o (to_FIFOF_0 (f_IMem_rsp));
    let y      <- fn_Decode (rg_Fetch_to_Decode, mem_rsp, rg_flog);
    rg_Decode_to_RR <= y;
    ...
endaction;
```

It writes its output into the `rg_Decode_to_RR` register. The Register-Read-and-Dispatch action looks like this and reads from the register:

```
src_Drum/CPU.bsv
Action a_Register_Read_and_Dispatch =
action
    let x      = rg_Decode_to_RR;
    ...
endaction
```

We can fuse these into a single action, like this:

```
Action a_Decode_and_Register_Read_and_Dispatch =
action
    // Decode part
    let mem_rsp <- pop_o (to_FIFOF_0 (f_IMem_rsp));
    let y      <- fn_Decode (rg_Fetch_to_Decode, mem_rsp, rg_flog);
    // RR-and-dispatch part
    let x      = y;
    ...
endaction
```

directly passing the output of `Decode` (`y`) as `x` in Register-Read-and-Dispatch. This allows us to eliminate the `rg_Decode_to_RR` register. In the main Drum FSM, we replace the original two sequenced actions:

```
a_Decode;
a_Register_Read_and_Dispatch;
```

with a single action:

```
a_Decode_and_Register_Read_and_Dispatch;
```

The same fusion can be done in Fife as well: we fuse `r1_Decode` (in package `S2_Decode`) with `r1_RR_Dispatch` (in package `S3_RR_RW`) into a single rule, and eliminate the FIFOs connecting the two modules (through which we were passing the `Decode_to_RR` intermediate value).

In both cases, we reduce, by 1 tick, the transit through `Decode` and `Register-Read-and-Dispatch`, and we eliminate the intermediate buffer (register or FIFOs), both of which appear to be improvements.

On the other hand, the depth of the combined step/stage combinational circuit is now the sum of the depths in the original two steps/stages; this may reduce the clock speed at which we can run it.

Fusion may not reduce the achievable clock speed if some other step/stage already limits the achievable speed. In this case we would say that the pipeline was originally “unbalanced” and that fusion has brought it closer to balance (all stages having approximately the same combinational delay). Fusion has merely utilized some timing “slack” that was already available in the two stages.

### 19.3.2 Drum and Fife: Fusing some Retire actions

In both Drum and Fife, exception handling is done by itself in a separate rule. In Drum, it is the last, separate action in the FSM, `a_exception`, in module `mkCPU` in `src_Drum/CPU.bsv`. In Fife, it is in rule `rl_exception` in module `mkRetire` in `src_Fife/S5_Retire.bsv`.

In both cases, the exception is actually detected in earlier rules, which save values in three registers `rg_epc`, `rg_cause` and `rg_tval` which are then used by the exception handling action.

Similar to the fusion idea of Section 19.3.1, the exception-handling action can be fused directly into the earlier actions where the exception was detected, and the three registers can be eliminated.

This transformation to reduce ticks for exception-handling may be less important because, almost by definition, exceptions should be rare, and therefore saving this tick may not affect application performance very much.

### 19.3.3 Drum (rules version): short-circuiting steps

In the “rules” version of Drum (`src_Drum/Drum_Rules.bsv` and described in Chapter 15), the “retire” rules (`rl_Retire_direct`, `rl_Retire_Control` `rl_Retire_Int` and `rl_Retire_DMem` set the next action to `A_EXCEPTION`.

This, in turn, enables the next rule `rl_exception` which tests whether there actually was an exception, and if not, sets the next action to `A_FETCH`.

This structure emerged due to a blind mechanical mimicing of the structure of the `StmtFSM` version of Drum.

Instead, in each of the “retire” rules, we can replace:

```
rg_action <= A_EXCEPTION;
```

with:

```
rg_action <= (rg_exception ? A_EXCEPTION : A_FETCH);
```

allowing it to “jump” back to `FETCH` immediately, instead of spending another tick in rule `rl_exception`.

This kind of “jump” is not possible in `StmtFSM`, which only allows structured compositions of sequencing and if-then-else.

### 19.3.4 Drum: using narrower inter-step/stage buffers

Consider the output of the `fn_Dispatch` in `src_Common/Fn_Dispatch.bsv`, with the following type:

```
src_Common/Fn_Dispatch.bsv: line 30 ...
1  typedef struct {
2      RR_to_Retire      to_Retire;
3      RR_to_EX_Control to_EX_Control;
4      RR_to_EX          to_EX;
5      Mem_Req           to_EX_DMem;
6  } Result_Dispatch
7  deriving (Bits, FShow);
```

In Drum, this result is stored in register `rg_Dispatch`. The width of this buffer is the sum of the widths of each of the four component types:

$$\begin{aligned} & \text{Number of bits in } \text{to\_Retire} \\ + & \text{Number of bits in } \text{to\_EX\_Control} \\ + & \text{Number of bits in } \text{to\_EX} \\ + & \text{Number of bits in } \text{to\_EX\_DMem} \end{aligned}$$

The `to_Retire` field has type `RR_to_Retire`, which contains an `exec_tag` field, whose type is:

```
src_Common/Inter_Stage.bsv: line 75 ...
1  typedef enum {EXEC_TAG_DIRECT,
2                 EXEC_TAG_CONTROL,
3                 EXEC_TAG_INT,
4                 EXEC_TAG_DMEM
5  } Exec_Tag
6  deriving (Bits, Eq, FShow);
```

Observe that `exec_tag` determines which of fields in the result of `fn_Dispatch` other than `to_Retire` are relevant:

- When `exec_tag` is `XEC_TAG_CONTROL` then only the `to_EX_Control` field is relevant.
- When `exec_tag` is `XEC_TAG_INT` then only the `to_EX` field is relevant.
- When `exec_tag` is `XEC_TAG_DMEM` then only the `to_EX_DMem` field is relevant.

With this observation, we can see that, in `rg_Dispatch`, we can have a common set of bits shared by these three cases. Then, the number of bits in `rg_Dispatch` could be reduced to:

$$\begin{aligned} & \text{Number of bits in } \text{to\_Retire} \\ + & \max( \text{Number of bits in } \text{to\_EX\_Control} \\ & \quad \text{Number of bits in } \text{to\_EX} \\ & \quad \text{Number of bits in } \text{to\_EX\_DMem} ) \end{aligned}$$

BSV has a convenient type-notation to express this “overlaying of alternatives”. In this case, it would look like this:

```
typedef struct {
    RR_to_Retire      to_Retire;      // without the exec_tag field

    union tagged {
        void            EXEC_TAG_DIRECT;
        RR_to_EX_Control EXEC_TAG_CONTROL;
        RR_to_EX         EXEC_TAG_INT,
```

```

    Mem_Req           EXEC_TAG_DMEM
    } exec_tag

} Result_Dispatch
deriving (Bits, FShow);

```

The new nested `exec_tag` field combines the previous “enum” type and the 3 struct fields into a single “union” indicating the alternatives for each tag.

### 19.3.5 Fife: saving FIFO resources

In Section 17.3 with Figure 17.2 and in Section 18.5.3 with Figure 18.6 we described how we connect Fife stages in a modular way, using a pair of FIFOs—a BypassFIFO in one stage (module) and a PipelineFIFO in the other, connected using `mkConnection` in the parent CPU module. We also explained how, while improving modularity, separate compilation and stage independence, it also costs us in resources: inside the pair of FIFOs are two data registers for what is essentially used as a 1-element FIFO.

We may now selectively optimize these FIFO pairs, replacing such a pair by a single FIFO. The upside is that the buffer now has just one register for the buffer. The downsides are (a) we now have a scheduling constraint across the buffer, where the rule on one side has to be scheduled before the rule on the other side, and (b) we now have a combinational path spanning both both stages. The longer combinational path may reduce achievable clock speed.

Whether we use a PipelineFIFO or a BypassFIFO for this purpose depends on where it is used, because of the scheduling constraint. Usually we will need a PipelineFIFO in a forward path and a BypassFIFO in a reverse path because, for pipeline behavior we need to schedule a downstream rule before an upstream rule.

### 19.3.6 Fife: Reducing the misprediction penalty

In Fife, in Fetch stage, after issuing an IMem read-request for the instruction using the address in `rg_pc`, it “predicts” the next instruction address to be `rg_pc+4`. That prediction could of course be wrong, if the previous instruction is a BRANCH or JUMP that takes it somewhere else, or if it traps, which takes it to the instruction address of the trap-handling code.

As described in Section 16.2, we manage these “mispredictions” using “epochs”. At any given time, we are in some particular prediction epoch. In Fetch, we attach the current epoch and the current next-instruction prediction to the instruction. In Retire, we can definitively check the next-instruction prediction. If mispredicted, we increment the epoch redirect Fetch with the correct PC and new epoch. Retire starts discarding all subsequent instructions that have the old epoch, because they are all in the “misprediction shadow”, and resumes normal operation only it sees the first instruction carrying the new epoch.

The time wasted on these discarded instructions is called the “*misprediction penalty*”. Clearly, we would like to reduce this wasted time.

#### 19.3.6.1 Save a tick for Fetch redirection using CRegs

In package `S1_Fetch`, rule `rl_Fetch_req` reads `rg_pc` for the instruction address used for IMem requests, and reads `rg_epoch` to accompany the instruction down the pipe (A).

Concurrently, `rl_Fetch_from_Retire` updates `rg_pc` and `rg_epoch` with a redirected PC and new epoch (B).

When `rg_pc` is a standard register, the update from (B) is visible to (A) only on the next tick.

Suppose:

- we replace `rg_pc` and `rg_epoch` by CRegs `crg_pc` and `crg_epoch` ((Section 18.3));
- we replace the register-writes (B) with writes to `crg_pc[0]` and `crg_epoch[0]`, and
- we replace the register-reads (A) with reads from `crg_pc[1]` and `crg_epoch[1]`.

Then, the updates in (B) is visible to (A) in the *same* tick.

Caution: we now have combinational paths (through the CReg) from (B) into (A), with a potential negative impact on clock speed.

#### 19.3.6.2 Save a tick for Fetch redirection by eliminating backward FIFO

Currently, the redirection from Retire back to Fetch spends a tick in a FIFO (see FIFOs `f_Fetch_from_Retire` in Retire and in Fetch stages).

We can eliminate this tick by having Retire directly write into registers `rg_pc` and `rg_epoch` for the redirection update.

Caution: we have reduced the isolation of the two stages Retire and Fetch, resulting in longer combinational paths, with a potential negative impact on clock speed.

#### 19.3.6.3 Quicker reaction to redirection by Register-Read and Dispatch

Currently, mispredicted instructions go through all their usual processing before being discarded at Retire. This may include several delays and possibly use resources:

- Stall at Register-Read due to read/write hazards.
- Reserve an Rd register, which then has to be un-reserved by Retire when the instruction is discarded.
- Spend multiple cycles in some execution stages (memory-access, integer multiply/divide, floating-point, ...).
- For DMem ops, “pollute” the cache with unnecessary accesses
- For DMem STORE ops, use up an entry in the store-buffer which then has to be discarded by Retire when the instruction is discarded.

If, in addition to the Retire→Fetch redirection message, we also feed the message to the Decode stage and/or the Register-Read-and-Dispatch stage, they can immediately start discarding wrong-epoch instructions, or convert them into no-ops. In either case, they no longer encounter the delays or use resources as listed above.

#### 19.3.6.4 Better next-PC prediction

The Fetch stage currently performs a fixed next-PC prediction like this:

```
src_Fife/S1_Fetch.bsv
let pred_pc = rg_pc + 4;
```

This is not bad, since in most codes there are medium to long stretches of “straight-line” code between BRANCH and JUMP instructions. Applications in the “scientific” domain have a lot of well-structured loops operating on vectors and rectangular matrices; in these applications it is often possible to “unroll” loops significantly, producing *very* long sequences of straight-line code (hundreds of instructions).

Still, prediction can be improved further; in modern processors, prediction success rates can be in the high 90% range, close to 100%.

Let us restructure the code slightly. Suppose, in this module `mkFetch`, we have instantiated a “`pc_predictor`” module with a “`pc_predict`” ActionValue method

```
src_Fife/S1_Fetch.bsv
let pred_pc <- pc_predictor.predict (rg_pc);
```

If the `pc_predict` method simply increments its argument by 4 and returns it, we are at status quo, *i.e.*, equivalent to the original code. But now we can replace the `pc_predictor` module with more sophisticated versions for better prediction.

**Branch Target Buffers** One technique is to add a “*Branch-Target Buffer*” (BTB), which is a table associating the PC of a BRANCH instructions with the next-PC when the BRANCH was recently executed.

Consider a simple loop like this:

```
initialize loop index and limit
B1: conditional BRANCH to B3 if loop index > limit
...
loop body and increment loop index
...
B2: unconditional BRANCH to B1
B3:
```

Suppose the loop iterates 100 times. Our simple PC+4 prediction will fail on every loop iteration at B2, and once at B1 at the end of the loop, *i.e.*, there will be 101 mispredictions.

When the branch at B1 or B2 reaches Retire, we know whether it was mispredicted or not. If mispredicted, then in the redirection message sent back to Fetch, we also indicate that this was a BRANCH instruction. In Fetch, in rule `r1_Fetch_from_Retire` where we update the PC and epoch, we also invoke a method `branch_train()` in the `pc_predictor` module informing it about the BRANCH’s PC and the actual next-PC, and this pair is stored in the BTB table.

In the `pc_predict(pc)` method, we consult the BTB for the argument PC. If we find a match, we return the associated next-PC; else, we return the default PC+4 prediction.

Now let us consider again our loop example. The first time we arrive at B2 we will mispredict PC+4, because B2 is not yet in the BTB, but we will also enter (B2,B1) into the BTB. In subsequent arrivals at B2, we will find B2 in the BTB and correctly predict that the next-PC is B1. Regarding B1, for the first 100 arrivals we will predict PC+4 (the default prediction) which is correct, but at the 101st (last) arrival this will be a misprediction.

The net outcome is that for the whole loop, where previously we had 101 mispredictions, the BTB reduces it to just 2 mispredictions. Also remember that “101” was based on the number of loop iterations (we assumed 100)—for a 1000-iteration loop, we would have 1001 mispredictions. The simple BTB has reduced it to a constant 2 mispredictions, no matter how many iterations.

**Adding hysteresis to the BTB** Consider executing our example loop a second time (perhaps it is nested inside an outer loop, or is in a function that has been called twice). Now, B2 will have no mispredictions, because the BTB already has a (B2,B1) entry. The first time we encounter B1 it will be mispredicted, because the last time we executed the loop, at loop exit we would have entered (B1,B3) into the BTB, whereas now we need PC+4. This misprediction will restore B1’s prediction

back to PC+4, and so for subsequent iterations, B1 will predict correctly. On the last iteration, B1 will again mispredict as PC+4, instead of B3. Thus, we have two mispredictions for the loop.

One technique is to add a little “hysteresis” or “delay” into the BTB. From Retire, we were sending a redirection message back to Fetch only on mispredictions. Now, suppose Retire *always* sends a message to Fetch for BRANCH instructions; in the case of a correct prediction, this is not a redirection but merely a confirmation of correct prediction. For each entry in the BTB, suppose we associate a 1-bit counter. When we create a new entry in the BTB, we enter (pc1,pc2,0). When we subsequently receive a confirmation for this prediction, we increment it to (pc1,pc2,1). The first time we receive a misprediction/redirection for pc1, we decrement it to (pc1,pc2,0), *i.e.*, we don’t change the BTB prediction, we merely reduce the “confidence level” of the prediction. The second time we receive a misprediction/redirection for pc1, we change the prediction to the redirected pc. Thus the 1-bit counter “delays” the change in the BTB.

Now consider again our example loop. In the first execution of the loop (with no entry for B1 in the BTB), from the first iteration we get (B1,B1+4,0) in the BTB, and from the second iteration onwards we have (B1,B1+4,1). After the 100th iteration, B1 will mispredict (it branches to B3). The BTB entry will decrement to (B1,B1+4,0), but we do not change the prediction. Now, in the second execution of the loop, in the first iteration, we will predict correctly, with the BTB entry now incrementing again to (B1,B1+4,1). We will mispredict only after the final iteration, on the loop exit. Thus, in repeated executions of this loop, we have reduced the mispredictions from 2 to 1. If this loop is executed 1000 times (it is nested inside another loop), that reduction of mispredictions goes from 2000 to 1000.

**Control-flow-indexed BTBs** In many RISC-V programs we will find that we can more accurately predict the outcome of a BRANCH if we knew how we arrived at that BRANCH, *i.e.*, the instruction trace that preceded that BRANCH. We don’t need every instruction in the preceding trace, just a trace of recent BRANCHes, *i.e.*, something that summarizes the *control flow* leading up to the BRANCH. Some BTBs are indexed by a short control-flow trace ending at the BRANCH PC, instead of just the BRANCH PC by itself.

**Return Address Stacks** JAL instructions are mostly used for subroutine calls, and JALR instructions are mostly used for subroutine returns (also subroutine calls to “distant” addresses or computed addresses) . By convention, compilers use RISC-V register x1 (also known as **ra**) for the saved “return address” (saved in a call and used in a return). Thus, when we see a JAL or JALR whose Rd register field is x1, it is probably a subroutine call. When we see a JALR instruction whose rs1 field is x1 and whose 12-bit immediate field is 0, it is probably a subroutine return. These observations (that a JAL or JALR is likely a call or return) can be detected in the Decode stage; when the instruction reaches Retire, it can pass this information to Fetch on the redirect path, and there it can be recorded in the PC predictor sub-module.

In the PC predictor module, we can maintain a small local stack called the “Return Address Stack”) or RAS. When asked to predict for a PC that is known to be a subroutine call, in addition to the prediction we can push PC+4 on the RAS. When asked to predict for a PC that is known to be a subroutine return, we can pop the prediction from the RAS. In effect, the RAS is “shadowing” or “mimicing” what happens on the actual program stack for return-addresses.

**Final remarks on PC prediction** The PC-predictor module requires careful engineering. First, the **pc\_predict** method must be fast—it is invoked on every cycle by **r1\_Fetch** in the Fetch stage. Thus, BTBs and RASs cannot be too large, otherwise it will not be possible to perform lookups at speed. Updates to the BTB and RAS (due to fresh information from Retire) must occur concurrently with lookups, but not interfere with lookups.

There is also a potential security problem with PC predictors. PC predictors, in their BTBs and RASs, are recording some abstracted information about the behavior of a program. Suppose we

switch to another (malicious) process without clearing this information. That program may be able to execute code that measures whether certain branches are predicted or not, and thereby glean some information about the original program. These kinds of “side-channel” attacks have become quite famous in recent years. One solution is to clear the PC predictor state completely on a process-switch. Another is to partition the PC predictor state and to reserve and use different partitions for different processes.

In general next-PC prediction can be viewed as an “online machine-learning” problem. It is a “machine-learning” problem because we are “training” the PC predictor with a data set (past trace of the program), and then we are making predictions based on that training. It is “online” because training is interleaved with use (as opposed to “offline”, where training and use are completely separate phases). It is also “online” because training is happening on live data, not a previously collected data set. These observations suggest some ideas for exploration:

- Training does not have to be online. One can run a program on a simulator and save precise prediction data to be pre-loaded into the online predictor. This could be done on a shorter run of the program (smaller input data) if the prediction data remains stable.
- Modern machine-learning techniques could be applied to the prediction problem.

### 19.3.7 Fife: Reducing the register-hazard penalty

The register-hazard penalty is the number of cycles an instruction I2 may stall (idle wait) in the Register-Read stage because one or more of its input or output registers are “busy”—there is an earlier instruction I1 ahead of it in the pipeline that will be writing to one of those registers. This “stall” situation is detected and managed using the scoreboard (Section 16.3).

#### 19.3.7.1 Fife Bypassing: Save a tick in GPRs and Scoreboard using CRegs

In package S3\_RR\_RW, rule `rl_RR_Dispatch` reads `rg_scoreboard` to detect whether it should stall, and `gprs` when it can proceed (A).

Concurrently, `rl_RW_from_Retire` updates `rg_scoreboard` and `gprs` with new information from Retire (B).

When `rg_scoreboard` is a standard register, and `gprs` do ordinary reads/writes into its underlying `RegFile`, the updates from (B) is visible to (A) only on the next tick.

Suppose we replace `rg_scoreboard` with a 2-port CReg `crg_scoreboard` (Section 18.3). In `rl_RW_from_Retire` we write port [0] of the scoreboard, and in `rl_RR_Dispatch` we read port [1] of the scoreboard. Then, reading the scoreboard (A) can receive the updated value from (B) in the same cycle.

We need to do something similar for the GPRs. In module `mkGPRs` we add a 2-port CReg:

```
src_Common/GPRs.bsv
Array #(Reg #(Tuple2 #(Bit #(5), Bit #(xlen)))) crg <- mkCReg (2, tuple2 (0, 0));
```

and we modify the methods as follows:

```
src_Common/GPRs.bsv
method Bit #(xlen) read_rs1 (Bit #(5) rs1);
    match { .rd, .rd_val } = crg [1];
    return ((rs1 == rd) ? rd_val : regfile.sub (rs1));
endmethod

... similarly read_rs2 ...
```

```

method Action write_rd (Bit #(5) rd, Bit #(xlen) rd_val);
    let v = ((rd == 0) ? 0 : rd_val);
    crg [0] <= tuple2 (rd, rd_val);
endmethod

```

Thus, when `write_rd` and `read_rd` are invoked in the same cycle on the same register (`rd` and `rs1`, respectively), the write-value `rd_val` is passed directly to the read-method in the same cycle (otherwise the read method gets it from the register file, as usual).

With these two changes, we save one tick in the register-hazard delay.

Caution: we now have combinational paths (through the CRegs) from (B) into (A), with a potential negative impact on clock speed.

### 19.3.7.2 Fife: Dispatching multiple instructions that write to the same Rd

Suppose we have two instructions, I1 followed by I2, that both write to the same register Rd. In the current code, I1 reserves Rd in the scoreboard, and I2 stalls if I1 has not released the reservation (not yet performed its write into the GPRs). But I2 does not need the value written to the GPR; so why did we stall? Suppose there is a third instruction I3 whose `Rs1 == Rd`. If we had allowed I2 to proceed, then when I1 finishes, it will release the Rd reservation, which will allow I3 to proceed. But this is wrong—I3 needs to wait for I2 to write a value into the register, not take I1's value.

Conceptually, we can think of the reservation bit for a register as a 1-bit counter indicating how many instructions (0 or 1) downstream expect to write Rd. We can generalize this idea. If we use 2 bits for the reservation in the scoreboard, we can allow 0 to 3 instructions downstream that expect to write Rd. For I2, instead of stalling if the reservation bit is 1, we now stall if the reservation bits value is 3. If the reservation bits value is < 3, we increment them and allow I2 to proceed, *i.e.*, we have eliminated a stalling cycle. When I1 and I2 retire, we decrement the reservation bits. For I3, we stall until the reservation bits value is non-zero.

### 19.3.7.3 Save a tick for register update by eliminating backward FIFO

Currently, the register update from Retire back to Register-Read-and-Dispatch spends a tick in a FIFO (see FIFOs `f_RW_from_Retire` in Retire and in Register-Read-and-Dispatch stages).

We can eliminate this tick by having Retire directly write into registers `rg_scoreboard` and `gprs`.

Caution: we have reduced the isolation of the two stages Retire and Register-Read-and-Dispatch, resulting in longer combinational paths, with a potential negative impact on clock speed.

### 19.3.8 Drum and Fife: Reducing memory system delays

Memory systems are outside the scope of this book; this section is mostly informational, to familiarize the reader with memory features of interest. We recommend [9] for a more thorough discussion of these topics.

A pipelined memory is one where the path through the memory system can itself sustain one memory request per tick. This allows the CPU independently to pump in several memory requests into the memory request channel, while concurrently pumping out memory responses for previous requests from the memory response channel. Ideally, the memory system has a *throughput* or *bandwidth* of one request per tick.

Memory *latency* is the delay from when the CPU issues an IMem or DMem request until it receives the corresponding IMem or DMem response.

### 19.3.8.1 TCMs (Tightly Coupled Memories)

Small computer systems (embedded, IoT) often use a small TCM (“Tightly Coupled Memory”) for their memory system. TCMs are usually implemented in SRAM (static RAM, BRAMs in FPGAs) which can be accessed in 1 tick, so in these memory systems it is fairly straightforward to have a 1-tick latency and also sustain a 1-request-per-tick throughput.

### 19.3.8.2 Caches

However, most modern systems have large memories based on DRAM (Dynamic RAM), which can take many cycles to access both because DRAMs are slower than SRAMs, and because requests and responses may have to traverse many layers of interconnect between the CPU and the DRAM. To alleviate this delay, most systems use *caches* close to the CPU. Caches are often implemented in SRAMs so that, when a request “hits” in the cache (*i.e.*, the addressed word is already in the cache), it is usually possible to perform the access in 1 tick. On a cache “miss”, on the other hand, the cache system has to “refill” the missing data by fetching it from DRAM, and this can take many cycles. The delay incurred here is called the “cache miss penalty”.

Non-blocking caches are able to process another memory request R2 while it is servicing a cache miss for an earlier request R1, *i.e.*, while it is busy fetching a cache line from DRAM for R1. These caches are more complex; simple caches process one request at a time.

### 19.3.8.3 Virtual Memory

Computer systems that run operating systems (including Linux, Windows, MacOS, ...) typically also support *virtual memory* (RISC-V “S” and “U” privilege levels with a virtual memory scheme such as “Sv32” or “Sv39”).

When virtual memory is active, every address sent in an IMem or DMem request is a “virtual address” (VA). The memory system first map the VA this into a “physical address” (PA) by traversing a mapping data structure called a “Page Table” (PT) using a procedure called a “Page Table Walk” (PTW). The PA is then processed as before, *i.e.*, looked up in a cache.

### 19.3.8.4 TLBs (Translation Lookaside Buffers)

A single VA-to-PA translation, involving a single PTW, involves multiple memory references, and so can take many cycles. This is an unacceptable penalty to pay for each IMem and DMem memory access, and so all such systems employ a “Translation Lookaside Buffer” (TLB) which can be regarded as a cache of previous translations. TLBs are typically small and implemented in discrete logic or SRAMs, and so can also be accessed in 1 tick, and sustain throughput of 1 translation per tick, provided we have a “hit” in the TLB. On a TLB miss, we have to perform a PTW to refill the TLB with the cached result; this takes several cycles and is the “TLB miss penalty”.

### 19.3.8.5 TLBs and Caches

Some memory systems compose the TLB and Cache sequentially, *i.e.*, lookup a VA in the TLB, and use the resulting PA for cache access. In these systems, the minimum overall memory access latency is 2 ticks, one for each of those two steps.

Some systems access the TLB and the cache in parallel, reducing memory-access latency to 1 tick. But how can the cache be accessed without the TLB output PA? They exploit the fact that certain bits remain the same between the VA and its PA. Virtual memory systems operate at the granularity

of “*pages*”, consecutive sequence of typically 4K or 8K bytes, aligned to an 4KB/8KB page address boundary. Thus, the “intra-page” bits remain the same between the VA and the PA, and can be used to index the cache lookup before knowing the TLB result. However, this limits the number of bits that can be used to index the cache, *i.e.*, the number of “sets” in the cache. To increase the size of the cache, one may have to increase the size of each set (more “ways”, or more “set-associative”).



# Chapter 20

## BSV: Suggested further study



20.1 Introduction

20.2 Alternative simulators: Bluesim, other Verilog sims

20.3 First-class modules

20.4 Polymorphism

20.5 Typeclasses. Conversion of Integer literals. Bits/pack/unpack

20.6 Bluecheck

20.7 Tagged unions, pattern-matching

20.8 Multiple Clock Domains

20.9 Importing RTL

20.10 BH alternative syntax



# Chapter 21

## RISC-V: Suggested further study



### 21.1 Introduction

In this chapter we provide suggestions for further study of RISC-V topics. Each of these can also be seen as suggested exercises for the interested student.

### 21.2 Implementing RV64I instead of RV32I

### 21.3 M Extension

Integer Multiply and Divide.

### 21.4 F and D Extensions

IEEE single- and double-precision floating point.

### 21.5 C Extension

Compressed instructions

Implementation: wholly in Decode stage. Can be shared between Fife and Drum.

Affect of compressed instructions on Fetch.

Affect of compressed instructions on PC-prediction.

Affect of non-32-bit alignment of compressed instructions.

## 21.6 Advanced extension

AMO operations

LR, SC, AMOxxx instructions and their implementation in the memory system.

## 21.7 Advanced branch prediction

Is a form of online machine-learning (past history is the “training data”).

Branch instruction taken/not-taken hints.

Hysteresis in prediction.

Branch-Target buffers (BTBs)

Return-Address Stacks (RASs)

## 21.8 Register renaming: towards out of order processing

An alternative to the counter in the scoreboard.

Have more phys regs than 32. Maintain a table that dynamically maps logical register num to phys reg num. This is “virtualizing” the registers.

In S3, for each instruction, allocate a new phys register num for its Rd, use that phys num in the instr going forward, and update the log-phys map.

In case of trap or misprediction, have to restore the logical-to-physical map.

## 21.9 Advanced bypassing: towards dataflow and out-of-order processing

In S3, do not stall any instruction for register-hazards, just forward each instruction into its appropriate S4 pipe, with each input register either having its value or a marker saying “not yet available”.

In each S4 pipe, stall if the head of the queue does not yet have its input values.

In each S4 pipe, as soon as an output register value is ready, broadcast it to the other S4 pipes, which update any entries awaiting that register value. This may release the stalling of the head entry, allowing it to execute.

Without register renaming, we have to keep the “counter” from the scoreboard. With register renaming, there is no need, since each physical register will have only one outstanding writer.

In each exec unit, we can treat the pending instrs as a set, not FIFO. I.e., execute any instr that is “ready”. Then we have OOO dataflow processing.

## 21.10 Memory systems: TCMs, Caches, PMPs

Separate I- and D-caches.

FENCE.I: for “manual” I- and D-Mem coherence.

FENCE: flushing caches for devices.  
Multi-level caches and cache hierarchies.  
Non-blocking caches.  
Cache-coherence.  
Memory Protection with PMPs.

## 21.11 Memory Systems: Virtual Memory

Page Tables, TLBs, Virtual Memory.

## 21.12 Performance measurement

CSRs TIME, MCYCLE.  
Other “hpmcounter” CSRs for other events. Counter enables.

## 21.13 Testing

ISA tests.  
Tandem Verification  
Sail formal model.  
ACTs (sp ?)

## 21.14 Interrupts

- General concepts: CSRs MIP and MIE; minimal MSTATUS with interrupt-enable bits
- Interrupts are initially disabled using the MSTATUS.interrupt-enable bit immediately; CSRxx can be used to re-enable.
- MMIO addresses MTIME, MTIMECMP.
- Interrupts are handled just like traps; the only question is: when to check for interrupts and respond.
- How does MIE bit return to 0?

PLIC, CLIC

Interrupt/trap delegation.

## 21.15 Linux and server-class capability

Multiple privilege levels: Machine, Supervisor, User

### 21.15.1 Hypervisor support

### 21.15.2 RISC-V ISA Formal Specification

Sail model



# Appendix A

## Resources: Documents and Tools

This appendix describes all the resources relevant to this course.

### A.1 GitHub

We will be using GitHub extensively. Course materials will be provided in a public GitHub repository, and GitHub’s “discussion” facilities can be used to for questions and answers, visible to all.

For students who do not already know how to use GitHub, we will teach the basics.

More detailed documentation can be found starting at: <https://docs.github.com/en/get-started/quickstart>

### A.2 RISC-V ISA (Instruction Set Architecture) Specifications

We will refer to the Unprivileged ISA very frequently, so you may wish to download a copy of the PDF for your laptop, and/or print a copy. The Privileged ISA document is not needed until later.

- “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA”.  
Bibliography entry [26] contains a link to [riscv.org](http://riscv.org) from which to download a PDF.
- “The RISC-V Instruction Set Manual Volume II: Privileged Architecture”  
Bibliography entry [27] contains a link to [riscv.org](http://riscv.org) from which to download a PDF.

The *formal specification* of the RISC-V ISA is written in the Sail formal-specification language, and can be found at <https://github.com/riscv/sail-riscv>.

- “The RISC-V Instruction Set Manual Volume I: Unprivileged ISA”.  
Bibliography entry [26] contains a link to [riscv.org](http://riscv.org) from which to download a PDF.
- “The RISC-V Instruction Set Manual Volume II: Privileged Architecture”  
Bibliography entry [27] contains a link to [riscv.org](http://riscv.org) from which to download a PDF.

### A.3 RISC-V Trusted Simulators and Reference Programs for Testing Implementations

The most well-known trusted simulator for RISC-V is the Spike simulator, a free and open-source simulator that is written in C++ and very carefully maintained by the RISC-V community as the standard “reference model” for RISC-V execution. The Spike simulator can be found at: <https://github.com/riscv-software-src/riscv-isa-sim>

Another trusted simulator is the Sail model (the Sail model is the official “formal specification” for RISC-V. The Sail model and simulator can be found at: <https://github.com/riscv/sail-riscv>.

Spike is usually more up-to-date with the latest ratified ISA extensions, compared to the Sail model.

RISC-V International maintains a set of standardized tests that are useful in testing new CPU implementations. The following repository—<https://github.com/riscv-software-src/riscv-tests>—contains several hundred small test programs, written in RISC-V Assembly Language, organized by ISA extension: RV32I, RV64I, A, M, F, D and C extensions, Machine/Supervisor/User mode, etc.

### A.4 RISC-V Assembly Language Manuals

We will not do very much assembly language programming, and we will teach whatever notation we need during the course.

There are several RISC-V Assembly Language manuals available online, and some in bookstores; download them only if you prefer a local copy:

- “RISC-V Assembly Programmer’s Manual”, Palmer Dabbelt, Michael Clark and Alex Bradbury.  
Bibliography entry [5] contains a link to online manual.
- “RISC-V ASSEMBLY LANGUAGE Programmer Manual Part I”, Shakti RISC-V Team, Indian Institute of Technology, Madras, India. Please see bibliography entry [22] for link from which to download a PDF.
- “An Introduction to Assembly Programming with RISC-V”, Edson Borin.  
Bibliography entry [3] contains a link from which to download a PDF.
- “RISC-V Assembly Language”, Anthony J. Dos Reis.  
Bibliography entry [8]. Available in bookstores.

### A.5 RISC-V GNU tools, including riscv-gcc compiler

We will be using the GNU tool chain, specifically the *gcc* compiler and linker, and the *objdump* tool for disassembling an ELF file.

During the course we will show you how to install and use these tools.

The use of these tools is mostly the same as when targeting any target architecture, including well-known architectures like x86 and ARM; the student can find voluminous tutorial materials available on the GNU tool chain on web and in books.

*gcc* has some specific options for RISC-V; these are documented here:

- <https://gcc.gnu.org/onlinedocs/gcc/RISC-V-Options.html>

- <https://gcc.gnu.org/onlinedocs/gcc/gcc-command-options/machine-dependent-options/risc-v-options.html>

It is also useful to know how to use the GNU debugger tool, *gdb*. Again, the student can find voluminous tutorial materials available on web and in books.

## A.6 BSV

In this course, we design the hardware of our RISC-V pipelined CPU using the High Level Hardware Description Language **BSV**. The reasons for our choice (instead of using Verilog, SystemVerilog or VHDL) are discussed in more detail in Appendix B of this document, as well as in the Introduction of the “BSV by Example” book described below.

No advance knowledge of **BSV** is needed for this course; we will teach all necessary **BSV** concepts during the course as we go along.

However, for those who would like to study **BSV** on their own, or wish to view additional **BSV** materials, the following sections provide some resources.

### A.6.1 “BSV By Example” book (free downloadable PDF)

This book takes the student through a series of small, targeted **BSV**: examples:

*BSV by Example*, by Rishiyur S. Nikhil and Kathy R. Czeck, 2010.

Quoting from the Introduction:

“This book is intended to be a gentle introduction to BSV.”

“ This book tries to take you into the BSV language one small step at a time. Each section includes a complete, executable (and synthesizable) BSV program, and tries to focus on just one feature of the language”

A bound copy of the book can be purchased on Amazon, but a PDF copy of the book and a tar file containing all the BSV program examples in the book can be downloaded for free from the GitHub BSVLang repository at:

[https://github.com/BSVLang/Main/tree/master/Tutorials/BSV\\_Training](https://github.com/BSVLang/Main/tree/master/Tutorials/BSV_Training)

- Book (PDF):  
*repository/Tutorials/BSV\_by\_Example\_Book/bsv\_by\_example.pdf*
- Machine-readable version of all examples in the book:  
*repository/Tutorials/BSV\_by\_Example\_Book/bsv\_by\_example\_appendix.tar.gz*

### A.6.2 BSV Tutorial

A **BSV** self-paced tutorial is available in the GitHub BSVLang repository:

[https://github.com/BSVLang/Main/tree/master/Tutorials/BSV\\_Training](https://github.com/BSVLang/Main/tree/master/Tutorials/BSV_Training)

in the directory *repository/Tutorials/BSV\_Training/* which looks like this:

```
BSV_Training/
Build/
Example_Programs/
Common
```

```

Eg02a_HelloWorld
...
Eg03a_Bubblesort
...
Eg04a_MicroArchs
...
Eg05a_CRegs_Greater_Concurrency
...
Eg06a_Mergesort
...
Eg09a_AXI4_Stream
Reference

```

Each of the `Eg*` directories contains a complete example, along with documentation explaining the example, and instructions on how to compile and Verilog-simulate it. The `Reference` directory contains a collection of lecture slide decks explaining the **BSV** language.

### A.6.3 MIT Course Material

Massachusetts Institute of Technology (MIT) periodically teaches courses on using **BSV** for digital hardware design. The following link:

[http://csg.csail.mit.edu/6.375/6\\_375\\_2013-www/handouts.html](http://csg.csail.mit.edu/6.375/6_375_2013-www/handouts.html)

contains downloadable material:

- PDFs of slide decks for 12 lectures
- PDFs of slide decks for 4 tutorials classes
- PDFs and codes for 6 laboratories

### A.6.4 University of Cambridge Examples

Prof. Simon Moore of University of Cambridge, UK, uses **BSV** in his teaching and research. Several of his **BSV** examples can be found here:

<https://www.cl.cam.ac.uk/~swm11/examples/bluespec/>

These examples are somewhat more advanced than the ones in the previous sections.

### A.6.5 *bsc* download and installation; *bsc* and **BSV** manuals

*bsc* is free and open-source, and can be downloaded and installed as described in **BSV**'s GitHub web site <https://github.com/B-Lang-org/bsc>.

On the main page of that repository you will find links to the following documents (same links also given here):

- The “**BSV** Language Reference Guide”. This document describes the syntax and semantics of **BSV**.  
PDF: [https://github.com/B-Lang-org/bsc/releases/latest/download/BSV\\_lang\\_ref\\_guide.pdf](https://github.com/B-Lang-org/bsc/releases/latest/download/BSV_lang_ref_guide.pdf)

- The “BSC Libraries Reference Guide”. This document describes the extensive set of libraries and IP (Intellectual Property blocks) available to the **BSV** user.  
PDF: [https://github.com/B-Lang-org/bsc/releases/latest/download/bsc\\_libraries\\_ref\\_guide.pdf](https://github.com/B-Lang-org/bsc/releases/latest/download/bsc_libraries_ref_guide.pdf)
- The “BSC User Guide”. This document describes how to use the *bsc* compiler, which compiles our hardware descriptions written in **BSV** into Verilog (which can then be simulated or synthesizes using standard Verilog tools).  
PDF: [https://github.com/B-Lang-org/bsc/releases/latest/download/bsc\\_user\\_guide.pdf](https://github.com/B-Lang-org/bsc/releases/latest/download/bsc_user_guide.pdf)

We will be using the Language Reference Guide and Libraries Reference Guide extensively, so you may wish to download a copy for your laptop.

## A.7 Verilator (or other Verilog simulator)

We will be doing Verilog simulations extensively during this course. For low cost (free), and uniformity, we will be using Verilator.

During the course, we will show you how to install Verilator and use it.

The Verilator web site, <https://www.veripool.org/verilator/>, contains instructions on how to install Verilator, and also links to PDF and HTML manuals for Verilator. Version 5.004, or any more recent version, will be suitable.

You can use other Verilog simulators if you prefer, but you should independently know how to use them because we cannot offer support during the course. Some possibilities:

- Icarus Verilog, also known as “iverilog”. This is a very good, free and open-source, easy-to-use Verilog simulator, but is quite slow compared to other Verilog simulators and so may be less useful for large designs.  
[https://steveicarus.github.io/iverilog/usage/getting\\_started.html](https://steveicarus.github.io/iverilog/usage/getting_started.html)
- Commercial simulators from Synopsys, Cadence or Siemens/Mentor Graphics), Aldec, and others. Each of these needs a paid license.

## A.8 Amazon AWS

All hands-on work in this course will be run on the Amazon AWS cloud. This way, everyone in the course has a common, stable, predictable environment and we do not have to waste any time dealing with the countless variations in environments found on different laptops and servers.

During the course, we will explain all necessary concepts as we go along, including how to set them AWS instances and use them.

The Amazon AWS cloud offers, on the “AWS Marketplace” a vast variety of choices for virtual machines or, to use AWS terminology, *instances*. We expect to use the following kinds of instances:

- A: An instance running the latest version of Ubuntu (Linux).
- B: A so-called “F1 instance”, also running Ubuntu. F1 instances have attached FPGAs.
- C: An instance running the so-called “AWS FPGA Developer AMI” available in the AWS Marketplace. This runs CentOS (Linux) and comes pre-installed with Xilinx Vivado tools, which we will use for creating FPGA bitfile images during the course.

In Amazon's pricing, (B) is the most expensive, and so we will use that only when we actually run on FPGA. For general development and simulation activities, we'll use (A) which is much cheaper. We will use (C) whenever we're creating a new FPGA bitfile image.

The standard Amazon documentation is can be found here:

- “Set up to use Amazon EC2”  
<https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.html>
- “Tutorial: Get started with Amazon EC2 Linux instances”  
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2\\_GetStarted.html](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html)

## A.9 Xilinx Vivado

The FPGAs on Amazon AWS F1 instances are Xilinx Ultrascale FPGAs. Thus, when we build bitfiles on Amazon AWS, we will be using Xilinx Vivado tools (which are provided by AWS for zero incremental cost on AWS FPGA Developer AMI instances, see [A.8](#)).

During this course, we will explain all necessary concepts as we go along.

When building a bitfile, it is particularly useful to understand how to interpret the Vivado timing and resource reports. The timing report indicates:

- whether or not our design has successfully met our desired frequency target (MHz), and
- if it did not, which part of our circuit is the likely culprit, which needs to be fixed.

The resource report indicates the “size” or our design (how many LUTs, flip-flops, BRAMs, DSPs, etc.).

For more details, Xilinx has extensive documentation for which a good starting point is the “Vivado Design Suite Overview” at

<https://docs.xilinx.com/r/en-US/ug910-vivado-getting-started/Vivado-Design-Suite-Overview>.

## A.10 RISC-V textbooks

This course is self-contained, and it is not necessary to acquire any textbooks.

The following list is provide only as a courtesy and convenience. All these books are written using the RISC-V instruction set as examples, and are available in bookstores.

- “The RISC-V Reader: An Open Architecture Atlas”, David Patterson and Andrew Waterman, Strawberry Canyon, 2017. Available in bookstores.  
 Bibliography entry [\[6\]](#).
- “Computer Organization and Design RISC-V Edition (2nd Edition): The Hardware Software Interface”, David A. Patterson and John L. Hennessy Morgan Kaufman, 2020. Available in bookstores.  
 Bibliography entry [\[18\]](#).
- “Computer Architecture: A Quantitative Approach, 6th Edition”, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. Available in bookstores.  
 Bibliography entry [\[9\]](#). This is the “classic” textbook on computer architecture, a more advanced textbook.

# Appendix B

## Why BSV?

The BSV language is a modern, high-level, hardware description language with a strong formal semantic basis. It is *fully synthesizable*, *i.e.*, the *bsc* compiler can also compile your source code into Verilog [11], which we regard as the “assembly language” of hardware design. That Verilog code can then be further compiled by ASIC synthesis tools such as Synopsys’ Design Compiler or FPGA synthesis tools such as Xilinx Vivado or Altera Quartus, for implementation in ASICs and FPGAs, respectively.

BSV is very suitable for describing architectures precisely and succinctly, and has all the conveniences of modern advanced programming languages such as expressive user-defined types, strong type checking, polymorphism, object orientation and even higher order functions during static elaboration.

All computation in BSV is expressed using “Rules”. For many people, this takes a little acclimatization because it is *very* different from traditional programming models (such as in C++ or Java) which are based on sequential processes. But over time, it becomes *the* natural way to think about hardware computation, which is based on massive, fine-grained, heterogeneous parallelism. Complex and high-speed hardware designs are full of very subtle issues of concurrency and ordering, and BSV’s computational model is one of the best vehicles with which to study and understand this.

Modern hardware systems-on-a-chip (SoCs) have so much hardware on a single chip that it is useful to conceptualize them, analyze them and design them as *distributed systems* rather than as globally synchronous systems (the traditional view), *i.e.*, where architectural components are loosely coupled and communicate with messages, instead of attempting instantaneous access to global state. This is because the delay in communicating a signal across a chip is now comparable to the clock periods of individual modules. Again, BSV’s computational model is well suited to this style of design.

A key to architecting complex systems and reusable modules, whether in software or in hardware, is powerful *interfaces*. Module interfaces in BSV are object-oriented (based on methods), polymorphic/generic, and capture certain computational protocols. This facilitates creating highly reusable modules, enables quick experimentation with alternatives structures, and allows designs to be changed gracefully over time as the requirements and specifications evolve.

Architectural models written in BSV are fully executable. They can be simulated in the Bluesim<sup>TM</sup> simulator; they can be synthesized to Verilog and simulated on a Verilog simulator; and they can be further synthesized to run on FPGAs or be etched into ASIC silicon, as illustrated in Fig. 1.1. Even when the final target is an ASIC, the ability to run on FPGAs enables early architectural exploration, early development of the software that will later run on the ASIC, and much more extensive and early verification of the correctness of the design. Students are also very excited to see their designs actually running on real FPGA hardware.

In this book, we teach the use of BSV for the design of complex hardware modules and systems by going in detail through a series of examples, and exploring basic concepts as needed along the

way, such as combinational circuits, pipelines, data types, modularity and complex concurrency. At every stage the student is encouraged to run the designs at least in simulation, but preferably also on FPGAs.

By the end of the course we will have seen all the source code for a complete simple, pipelined RISC-V CPU along with a small “SoC” (System-on-a-Chip) including an interconnect, a connection to memory, and a few devices such as a UART.

## Who uses BSV?

BSV has been used in teaching and research at major universities, including MIT (Massachusetts Institute of Technology, USA), University of Cambridge (UK), Indian Institute of Technology, Madras (India), Indian Institute of Technology, Mumbai (India), Seoul National University (South Korea), University of Texas at Austin (USA), Carnegie Mellon University (USA), Georgia Institute of Technology (USA), Cornell University (USA) and Technical University of Darmstadt (Germany).

BSV has been used to design major IP components in commercial ASICs from Texas Instruments, ST Microelectronics and Google. It has been used for FPGA-based modeling at IBM, Intel, Qualcomm, Microsoft Research, several DARPA projects, and others. It is being used for commercial RISC-V processors from InCore Semiconductors (Shakti line of RISC-V processors, India) and The C-DAC (Center for Development of Advance Computing) (Vega line of RISC-V processors, India).

## B.1 Why BSV instead of some other Hardware Design Language?

*The rest of this chapter is intended for those interested in comparing BSV’s approach to other approaches (Verilog, SystemVerilog, VHDL, and SystemC), and can be safely skipped by others who just want to get on with learning BSV.*

One may be curious why the material in this book could not have been covered using one of the more widely known languages for hardware design: Verilog [11], VHDL [10], SystemVerilog [13], SystemC [12]. There are several reasons, outlined below.

In the following paragraphs, we will refer to all the above languages, or at least their synthesizable subsets, as “RTL” (Register Transfer Level languages).

### B.1.1 A better computational model

Paradoxically, the formal definitions of the semantics of traditional hardware design languages (HDLs)—Verilog, SystemVerilog, VHDL and SystemC—are not in terms of hardware concepts, but in terms of *software simulation* on conventional computers. Like traditional software programming languages, they are defined in terms of sequential statement execution, with traditional conditionals, loops, and procedure calls and returns, reading and writing conventional variables. Programs can have multiple concurrent *processes* (e.g., “always blocks” in Verilog), but each of them is defined with traditional sequential programming semantics.

Digital hardware, on the other hand, has a quite different computation model. It consists of hundreds, if not thousands of concurrent “state machines” that transform the current state of the hardware, implemented using registers, memories and FIFOs. By and large, there is no sequencing of these state machines based on program counters or statement sequences. Rather, these state machines are independent and “reactive”, *i.e.*, each one performs an action whenever certain conditions hold, e.g., when a register holds a particular value, or a value is available in a FIFO, etc.

To bridge this rather large gap from conventional sequential processes to concurrent reactive state machines requires a major mental shift. One must severely restrict code to only a much smaller

so-called *synthesizable subset* of a conventional HDL. Processes are restricted to simple clocked loops: “`always @posedge CLK ...`”, also known as an “always-block”. Even more draconian is a transposition away from the natural concurrent state machine view to a *state element-centric* view: even though a state element may be read and written by multiple state machines, all updates to that state element must be concentrated in a single always-block, usually in a large conditional construct (if-then-else, case, ...) that describes all the different contributions of different state machines. This transposition, from the natural state machine-centric view to the rather unnatural state element-centric view, is necessary because in the the synthesizable subset there is no synchronization between always-blocks; the programmer has to plan every detail of how to resolve (arbitrate) competing updates to each state element.

In other words, in conventional HDLs, neither the simulation view (sequential processes) nor the synthesizable view (state element-centric always blocks) are a natural way to model hardware behavior.

BSV programs, instead, directly express the natural model of hardware—concurrent state machines. Each “rule” in BSV is a reactive state transition that awaits some condition on the hardware state and then takes an action to transforms the state. Further, each rule is an *atomic transaction*, *i.e.*, the details of how one arbitrates competing accesses from multiple rules to common shared state is left to the compiler. This kind of arbitration logic, which is hand-written in other HDLs, is a major source of bugs.

In BSV, unlike in other HDLs, the semantics are identical whether you execute in simulation or in hardware—there is no mental gear shift necessary, and simulation behavior is always identical to synthesized hardware behavior.

Finally, the Rules computation model uniquely encourages *refinement*, a powerful design methodology. We initially create a high-level, approximate model of a target design, using a few large rules. Both the level of micro-architectural detail and the range of functionality are approximated (abstracted). Often such a model can be written in less than a day, and it can immediately be executed to verify functional correctness. Then, over time, we incrementally add architectural detail—for example, pipeline registers and state machines with more, smaller steps—and the original rules (large step state transitions) are replaced by more, smaller rules (small step state transitions). The atomic semantics of rules makes this a robust methodology, *i.e.*, a refinement does not have a large ripple effect. This is in quite dramatic contrast to the difficulty in changing RTL, which is notoriously brittle and unforgiving.

Refinement allows early and continuous confidence in functional correctness and completeness, since we execute the code very frequently. Refinement allows mid-course corrections in functionality, after observing execution on real data. Refinement allows separating *functionality* from *performance*, achieving functionality early and holding it constant while we improve performance to meet a performance target (by target performance we mean some desired targets for speed, area, and power).

### B.1.2 Modern language features

The field of programming languages has seen tremendous progress since the early days (1950s). Modern high-level languages have advanced type systems (polymorphism, typeclasses and overloading, functional types, and so on). Modern high-level languages have strong mechanisms for encapsulation and abstraction (such as object-orientation) which promote the separation of concerns between externally visible behavior and internal representation choices. Modern high-level languages make frequent use of higher-order functions—functions whose arguments and results can themselves be functions and data structures whose components can be functions.

Unfortunately, practically none of these powerful features are present in the synthesizable subsets of conventional HDLs<sup>1</sup>. BSV, on the other hand, adopts the full power of the Haskell functional pro-

---

<sup>1</sup>SystemVerilog and SystemC have object-orientation, polymorphism, and overloading, but these are

gramming language [19]: algebraic types, functional types, polymorphic types, typeclasses, higher-order functions, and recursive and monadic static elaboration. This delivers unprecedented expressive power, type safety and type flexibility in a hardware design language.

### B.1.3 Comparison with C++-based High Level Synthesis

Recently, some tools have become available under the rubric of “High Level Synthesis” (HLS) that claim to shield you from this mental gear shift from simulation to hardware. Designs are written in a traditional sequential programming language (typically C++), and an HLS tool automatically compiles this into a hardware implementation. While beautiful in concept, there are many serious limitations in practice, which are discussed below.

#### B.1.3.1 C++ codes need significant rewriting

C++ HLS tools will rarely accept arbitrary, off-the-shelf C++ codes and produce good hardware implementations. C++ codes often require significant restructuring to achieve good results.

First, the tools only accept a limited subset of C++ syntax. In particular, these tools are very averse to any kind of pointer-based argument passing or data structures, unless all the pointers can be resolved by the compiler (*i.e.*, the compiler statically knows the addresses represented by the pointers). This is because, while C++ normally executes on machines that provide the abstraction of a single large memory with a single address space (so a pointer is fundamentally an address, and dynamic allocation and relocation are easy), hardware designs typically use hundreds or thousands of individual memory units, from registers to register files to SRAMs, DRAMs, Flash memories, and ROMs, each with its own address space.

Second, most C++ codes written for conventional execution rely deeply on sequential execution. For example, they may re-use a variable (multiple reads and writes in different phases of the code). Many of these programming techniques, often a good idea for higher performance and smaller memory footprints in conventional execution, are exactly the opposite of what is needed for hardware implementation, which is highly parallel.

Overall, for good results, one must develop a keen sense of the hardware implementation impact of various “styles” of writing C++ code. Small changes in style can mean the difference between a terrible implementation and an acceptable one. One vendor insists that any team adopting their tool should not consist solely of C++ experts, but must also include hardware engineers.

#### B.1.3.2 Narrow range of applicability due to automatic parallelization

C++ is, by official definition, a completely sequential language. Hardware, on the other hand, relies on massive, fine-grain parallelism. It is the HLS tool that has to pull off this magical transformation.

C++ HLS tools rely on a body of knowledge called CDFG Analysis (Control and Data Flow Graph Analysis). After parsing and typechecking, the C++ program is represented internally in a data structure called the CDFG. This CDFG, initially directly reflecting the sequential nature of the source, is analyzed and transformed into a parallel representation from which, eventually, hardware is generated.

It turns out that this transformation only works well for a narrow range of program structures—cleanly nested `for`-loops with fixed iteration bounds, operating on dense rectangular arrays. Of course, many signal-processing and image-processing applications do have this structure, and C++ HLS tools have found their greatest success in this arena.

---

typically used only in simulation for verification environments of hardware designs, not for actual hardware design itself.

But the moment we step outside this sweet spot, towards sparse arrays or programs that are highly control-dominated, these tools fall off a cliff. Most hardware design in fact involves components that don't fall into the C++ HLS sweet spot: CPUs, cache systems, switched interconnects, flash memory and disk controllers, high-speed I/O controllers for Ethernet, PCIe, USB, and so on. For example, we are unaware of any project using C++ HLS for CPU design, whereas there are over a dozen such projects using BSV.

### B.1.3.3 Lack of “Algitecture”: Architectural transparency and predictability

Most people with some training in Computer Science are familiar with the idea that Algorithms are Job One—when writing performance-critical software, the first-order concern is to design a good algorithm. Further, creating a good algorithm is a creative act; compilers don't automatically create good algorithms for you<sup>2</sup>.

Unfortunately, because most of our codes run on classical von Neumann machines, many people forget that, when the execution platform changes, our old algorithms may no longer be any good—the assumptions about the cost of fundamental operations may no longer valid and in fact may be wildly different, requiring a complete re-think of the algorithm.

This bring up a fundamental difference between software design and hardware design. In software, you are given a particular target architecture (CPU, GPU, cluster, vector machine, ...), and the designer's job is to design a good algorithm for that fixed architecture. In hardware, on the other hand, the designer's job is to design the algorithm and the architecture *jointly*. In other words, for hardware designers, algorithm and architecture are joined at the hip; it is meaningless to separate these activities. We thus use the term *Algitecture* to describe this integrated activity.

Unfortunately, most C++ HLS tools provide very narrow visibility and control into architecture. For example, directives for loop unrolling and loop fusion may allow you to express some variation in iterative *vs.* parallel *vs.* pipelined structures. But, basically, it's the tool that chooses the architecture, and you have some weak knobs to guide its choices. A common syndrome with C++ HLS tools is that one quickly produces an implementation, but it is terrible in area or performance, and this is followed by a *long tail* of activity in which the designer tweaks the knobs every which way in an effor to beat it down into the desired performance envelope.

In contrast, with BSV, architectural choices (like algorithmic choices) are in the hands of the designer, where it should be. There are no surprises with respect to architecture; performance is never a mystery, and the designer can quickly improve it and converge to an acceptable solution.

### B.1.3.4 Summary

In summary, it is our experience that BSV is a much better language for complex hardware design, whether control or data oriented, whether for modeling or architectural exploration or final implementation, or for synthesizable on-FPGA verification transactors. Following the philosophy of DSLs (Domain Specific Languages), BSV is very much an expressive DSL beautifully suited for hardware design, whereas sequential C++ is certainly not (it was never intended to be!).

---

<sup>2</sup>Of course, there is research in this area, but this starts entering the realm of Artificial Intelligence.



# Appendix C

## Glossary

**2's Complement** See entry for “Two’s Complement”.

**ACTs** Architecture Compatibility Tests. A test suite under development by RVI to verify that a given implementation complies with a particular subset of the RISC-V ISA. A candidate implementation must run the relevant ACTs and produce correct “expected” output signatures.

**API** Application Programming Interface. Term commonly used in many programming languages, methodologies and protocols to describe the set of functions/procedures/methods used to interact with a module/object by external entities (from outside the module/object). The API clearly separates external concerns from internal concerns. External concerns are about “what” a method does or sequence of methods do: what are their argument and result types, and what do they (abstractly) achieve. Internal concerns are about “how” methods do what they are supposed to do. This separation of concerns also allow transparently substituting a module implementation with an alternate implementation (*e.g.*, for greater efficiency) without disturbing the external context.

**ASIC** Application-Specific Integrated Circuit. A kind of electronic device that represents a desired digital circuit directly in silicon and has been fabricated for that purpose (not customizable and general-purpose like an FPGA).

**BSV, BH** An open-source, modern, High-Level HDL. Two optional syntaxes (choose to one’s taste): BSV has traditional Verilog-like syntax, BH has traditional Haskell-like syntax.

**BTB** Branch Target Buffer. A component of a PC-predictor module. See Section [19.3.6.4](#).

**CISC** Complex Instruction Set Computer (see also, for contrast, “RISC”). Refers to an ISA where a single instruction may express quite a bit of computation, such as memory accesses, ALU ops, and even iteration.

**CPI** Cycles per Instruction. Please see entry for its inverse, IPC.

**CPU** Central Processing Unit. The computational element of a computer.

**CReg** Concurrent Register. See Section [18.3](#). Also known as EHRs (Ephemeral History Registers).

**CSRs** Control and Status Registers. These are special registers in the ISA, most of which are accessibly only while executing at higher privilege levels (Machine and Supervisor). Certain key CSRs play a central role in disciplined transition between privilege levels, in virtual memory, and in memory protection.

**DRAM** Dynamic Random Access Memory. A kind of silicon chip that implements memory. Compared to SRAM, is larger (number of bits), denser (bits per silicon area), cheaper (\$ per bit), uses less power (watts per bit) and is more complex to operate (needing regular refreshing *etc.*). Usually off-chip (not part of an ASIC or an FPGA).

**DUT** Design Under Test. Term commonly used by hardware designers to indicate the artefact being designed, to contrast in particular with the “testbench” or “test harness” which surrounds the design during testing. DUT is often articulated as a word in its own right (pronounced “Dutt”, as in the famous name in Bombay cinema [https://en.wikipedia.org/wiki/Nargis\\_Dutt](https://en.wikipedia.org/wiki/Nargis_Dutt)).

**EHR** Ephemeral History Register. The original name for BSV’s CRegs, when developed by Daniel Rosenband and Arvind at MIT [20, 21].

**FPGA** Field Programmable Gate Array. A kind of electronic device that has configurable circuits that can be customized to represent any desired digital design. These are catalog parts available from several vendors.

**FPGA Board** A circuit board containing one or more FPGAs, a power supply, and DRAM memories. Often contains other facilities such as GPIO, UARTs, JTAG, PCIe bus connections, Ethernet connection, USB connection, Flash memory, and so on.

**FSM** Finite State Machine. A sequential process that moves (“transitions”) from one state to another in a fixed repertoire of states. Transitions may loop back to earlier states, and may conditionally select one of a set of alternative next-states.



FSMs are named after the Greek town of Ephesos, the capital of the Kingdom of Ephesus, whose most famous resident was a gentleman named Sisyphus (<https://en.wikipedia.org/wiki/Sisyphus>), who repeatedly rolled a boulder up a hill only for it to roll back down again. This was widely reported in a Greek media outlet of the time called the Drudge Report.

**GPIO** General Purpose Input Output. An electronic device attached to a computer system. When the CPU stores a byte/word to a GPIO address, the bits of the word appear as electronic signals from the device, and can be used as an *actuator*—switch on/off a bank of LED lamps, a relay, a motor, *etc...*. When the CPU loads a byte/word from a GPIO address, it can read the state of a *sensor*—switches, photocells, motor speed, temperature, *etc..*

**GPR** General Purpose Register. For RISC-V, just a synonym for the basic register set holding integers. They are “general purpose” in the sense that software is free to use them in any way (in contrast with some earlier ISAs that restricted certain registers to certain roles, such as holding addresses).

**HDL** Hardware Design Language. A language in which one can represent circuits, and for which there are tools that can render a program into actual circuits for FPGAs and ASICs. Examples include: BSV, BH, Chisel, Verilog, SystemVerilog, VHDL.

**HLHDL** High-Level Hardware Design Language. An HDL with higher-levels of abstraction and more powerful constructs and semantics compared to the traditional HDLs Verilog, SystemVerilog and VHDL, in the same sense that modern software programming languages (Java, Python, Javascript, Haskell, OCaml, ...) have higher-levels of abstraction than C/C++ which, in turn, have higher levels of abstraction than Assembly Language. Examples include BSV, BH (the Haskell-syntax variant of BSV), Chisel, and HLS.

**HLS** High Level Synthesis. The term typically used for tools and methodology that compile C/C++/SystemC programs into hardware. HLS can be fragile in that it works best only on certain subsets of C/C++ (“simple rectangular loop and array” algorithms), and require certain coding styles and directives.

**ILP** Instruction-Level Parallelism. A measure of how many instructions can be executed in parallel without violating the canonical sequential (one-instruction-at-a-time) semantics of an ISA.

**IPC** Instructions Per Clock (or its inverse, CPI, or clocks per instruction). A component, together with clock speed (cycles per second), of a CPU’s application performance. Application performance depends on clock speed, IPC, and total number of instructions.

**ISA** Instruction Set Architecture. A specification of instructions: how an instruction is coded in bits; “architectural state” (PC, registers *etc.*); what it means to execute an instruction; assembly language syntax. The specification is described independently of any particular implementation, traditionally in a manual with text and diagrams, occasionally and recently also in a formal-specification language.

An ISA can (and typically does) have many possible implementations, varying widely in speed, size, power, cost, technology (ASIC, FPGA), *etc.* Examples of famous ISAs and vendors who supply implementations include RISC-V (diverse vendors), x86 (Intel and AMD), ARM (Arm, Apple, Samsung, others), Sparc (Sun, Oracle, Fujitsu, others), MIPS (MIPS, Inc.), Power and PowerPC (IBM, others), ...

**Microarchitecture** The structural and behavioral details of an ISA implementation that are *below* the level of abstraction of the ISA, *i.e.*, not demanded by the ISA but chosen by the implementor for practical reasons (speed, power, area, cost, ...). Examples: pipelines, branch prediction, scoreboards, register renaming, out-of-order execution, superscalarity, instruction fission and fusion, replicated execution units, store-buffers, ...

**MMIO** Memory-Mapped Input-Output. In RISC-V, the CPU reads and writes registers in a device using ordinary LOAD and STORE instructions. The memory system interprets the addresses to direct such requests to a device. Using LOADs and STOREs, the CPU can control the device, send data to the device and retrieve data from the device.

**OOO** Out-Of-Order. Refers to advanced CPU microarchitectures which replicate functional units (computational and memory access units); and allow each functional unit immediately to execute any instruction that is “ready” (because its inputs are available), even if that is not in program order. Often combined with SuperScalarity, to achieve vastly more instruction-level parallelism (ILP) compared to a simple in-order pipeline.

**OS** Operating System. Can vary from small, embedded, real-time OSs such as FreeRTOS, to more capable embedded OSs like Zephyr, to secure micro-kernels like seL4, to full-featured OSs like Linux, Windows, MacOS, Solaris, AIX, *etc.*

**PTW** Page Table Walk. A function in systems supporting virtual memory, to translate virtual memory addresses into physical memory addresses (not described in this book). Requires multiple memory references to descend a “tree” data structure called the Page Table.

**RAS** Return Address Stack. A component of a PC-predictor module. See Section 19.3.6.4.

**RISC** Reduced Instruction Set Computer (see also, for contrast, “CISC”). Refers to an ISA that separate memory-access instructions from computational instructions, and where each instruction is amenable to fast pipeline implementations.

**RISC-V** A particular standard ISA. Originated circa 2008-2010 in research at University of California, Berkeley, and subsequently spun out (2010s) into an international non-profit consortium “RISC-V International” (RVI) headquartered in Switzerland (<https://riscv.org>).

Unlike other well-known ISAs, the RISC-V ISA is an *open* standard, *i.e.*, implementors do not need to pay any license fee in order to use the ISA, which is one of the factors behind its wide adoption by hundreds of vendors.

**RTOS** Real-Time Operating System. A typically small operating system for small embedded systems. Compared to, say, Linux,

- It may not support multiple privilege levels (Machine, Supervisor, User).
- It may not support multiple processes, or a variable number of processes.
- It may not support virtual memory.
- It may not support memory protection across processes.
- It may only support a limited repertoire of devices.

**RTL** Register-Transfer Level/Language. This is a level of abstraction of describing hardware that assumes that the available primitive components are clocked registers and combinational circuits for multiplexers, and basic arithmetic and logic functions (adders, subtractors, boolean operations, shifters, *etc.*).

This is a higher level of abstraction than AND/OR/XOR/NOT gates which, in turn, are a higher level of abstraction than transistors which, in turn, are a higher level of abstraction than silicon regions. Each layer of abstraction is automatically compiled to a lower layer using various tools.

**RVI** RISC-V International. See entry for RISC-V.

**Slack** A measure in digital circuits for the amount by which a combinational circuit exceeds the delay requirement for a target clock speed. For example, if the target clock speed is 250 Mhz, *i.e.*, has a 4 ns period, and a combinational circuit between registers only takes 3.2 ns, then it is said to have a “positive slack” of 0.8 ns.

**SoC** System-on-a-chip. Refers to a complete computing system on a chip, including one or more CPUs (with MMUs and caches), shared caches, interconnects, DRAM interface, JTAG, accelerators and devices, *etc.*

**SRAM** Static Random Access Memory. A kind of silicon chip that implements memory. See DRAM above for comparison. Usually on-chip in an ASIC or an FPGA.

**BRAM** Block RAM. A memory component in an FPGA, usually implemented with SRAM.

**Superscalar**. Refers to advanced CPU microarchitectures which fetch and execute multiple instructions at a time (typically 2, 4, 8). Often combined with OOO (Out-of-order), to achieve vastly more instruction-level parallelism (ILP) compared to a simple in-order pipeline.

**SystemVerilog** One of the major HDLs. Originally created in the 2000s as a proper superset of Verilog (and thereby subsuming Verilog), and incorporating many features from VHDL; incorporated some modern features from object-oriented software programming languages (principally used in verification testbenches in simulation only); then an IEEE standard that has gone through several versions. Can be used for both analog and digital circuits. Some features can only be used in simulation (a “synthesizable subset” can be rendered into hardware).

**TCM** Tightly Coupled Memory. Usually an SRAM (Static RAM) used directly as the memory of the CPU, with no cache in front of it.

**TLB** Translation Look-aside Buffer. A component used in systems supporting virtual memory, to speed up translation of virtual memory addresses into physical memory addresses (not described in this book.)

**Two's Complement** A particular representation of positive and negative integers in bits (binary) that makes it possible to perform both addition and subtraction using the same hardware. Wikipedia has a good discussion: [https://en.wikipedia.org/wiki/Two%27s\\_complement](https://en.wikipedia.org/wiki/Two%27s_complement)

**UART** Universal Asynchronous Receiver/Transmitter. An electronic device attached to a computer system through which the CPU can read ASCII characters from a keyboard and send ASCII characters to a display screen. Typically used for the main console of a computer system.

**UVM** Universal Verification Methodology. A standard methodology and technology in SystemVerilog for testbenches for verification. Exploits the “object-oriented programming” aspects of SystemVerilog for reusability. Wikipedia has an introduction ([https://en.wikipedia.org/wiki/Universal\\_Verification\\_Methodology](https://en.wikipedia.org/wiki/Universal_Verification_Methodology)). There are dozens of tutorials, textbooks and UVM IP providers, both open-source and proprietary.

**Verilog** One of the two grand old HDLs (the other is VHDL). Originally created in the 1980s; then an IEEE standard that has gone through several versions; then subsumed by SystemVerilog. Can be used for both analog and digital circuits. Some features can only be used in simulation (a “synthesizable subset” can be rendered into hardware).

**VCD** Value Change Dump. A file written out during simulation (Verilog simulation or Bluesim simulation) that contains a clock-time-stamped record of every change on every bus (bundle of wires) in the hardware design. This file can then be viewed graphically in any waveform viewer. Waveform viewers are bundled into most commercial simulators. Alternatively, *gtkwave* is a popular free and open-source waveform viewer.

**VHDL** One of the two grand old HDLs (the other is Verilog). Originally created in the 1980s; then an IEEE standard that has gone through several versions. Many features were adopted by SystemVerilog. Can be used for both analog and digital circuits. Some features can only be used in simulation (a “synthesizable subset” can be rendered into hardware).



# Appendix D

## BSV: Importing C/C++ functions into BSV simulations

### D.1 Introduction

This facility is only used in BSV simulation (compiling C code to hardware is very difficult even in limited contexts, see Section [B.1.3](#)) and it is principally used in *testbenches* for designs.

There are many reasons to import C functions into BSV (whether simulating in Bluesim or in Verilog simulation):

- Many applications begin life as a C/C++ model, such as a C/C++ algorithm that we are trying to accelerate in hardware, or an algorithm that we are still prototyping in C/C++ while developing the rest of the system.
- C/C++ is useful for testbench components: They will run much faster than Verilog simulation, and so will not be a performance burden on simulating the Verilog of interest. They have full-service access to data files and operating-system services.
- Even for an actual hardware component for which we have BSV or Verilog code, we may temporarily substitute it with a much faster C model while our focus is on testing other parts of the hardware.

In Drum and Fife code, the whole CPU design is written in BSV. However, to test the CPU we need to connect it to a memory system, a UART, *etc.* and for these we use models written in C. Modeling the memory system in C makes it easy to include code for pre-loading ELF and memhex files.

A C function can be imported into BSV with the simple steps described in the next sections.

BSV can only import C functions directly. If you need to use a C++ function, write a C wrapper function that is invoked from BSV; that C function can then invoke your C++ function.

An imported C function invoked in BSV is semantically instantaneous, never a temporal process. It is imported with `Action` or `ActionValue` type or a pure (combinational) type.<sup>1</sup>

---

<sup>1</sup>Of course an invoked C function, before it returns, could start a C “pthread” which then runs concurrently with the BSV simulation.

## D.2 In BSV code, declare a BSV version of the C function

In the BSV code, the imported C function will be invoked exactly like a normal BSV function. We declare the header of this “BSV” function (*i.e.*, just the result type, function name and arguments with their types, and not the function body), preceded by the phrase `import "BDPI"`:

```
import "BDPI"
function BSV-type function-name ( BSV-type arg
... arg );
```

The argument- and result-passing conventions are simple, adjusting for the C’s limitation that arguments and results be can at most 64-bits wide:

- BSV values of sizes up to 8, 16, 32 or 64 bits are passed as arguments and results of C type `uint8_t`, `uint16_t`, `uint32_t` and `uint64_t`, respectively, for both arguments and results, in the same corresponding positions.
- For a BSV argument value of size greater than 64 bits, it is passed to C as a pointer to memory containing the BSV value.
- For a BSV *result value* of size greater than 64 bits, the corresponding C function prototype changes slightly:
  - It gains an extra first argument which gets a pointer to memory which should be filled by the C function with the BSV value to be returned.
  - Because of this, it has a `void` return type.

## D.3 Compile the BSV code with the *bsc* compiler

If you are compiling for Bluesim, there is no change to how you invoke *bsc*.

If you are compiling to Verilog, provide the additional flag “`-use-dpi`” on the command line. This will generate standard SystemVerilog `import "DPI-C"` declarations in the generated Verilog.

## D.4 Linking

For Bluesim linking, simply provide the C file(s) implementing imported C functions as an additional command-line arguments when you again invoke *bsc* for linking.

For Verilog linking, in the imported C code we recommend adding the following for each of the imported C functions:

```
1 #ifdef __cplusplus
2 // 'C' linkage is necessary for linking with Verilator object files
3 extern "C" {
4     ... imported C function's prototype ...
5 }
6#endif
```

This directs the C/C++ compiler to compile the C function with C argument-passing conventions instead of C++ argument passing, which is necessary for SystemVerilog DPI-C. Verilog simulator tools like Verilator compile the C code with a C++ compiler which, by default, would use C++ argument-passing.

For Verilator linking, simply provide the C/C++ file(s) implementing the imported C functions as additional arguments to the `verilator` command.

For other Verilog simulators, the linking details may vary; please consult their respective manuals or experts on how to link-in C code under the SystemVerilog DPI-C standard.

## D.5 Recommendations for arguments and results of imported C/C++ functions

The following recommendations circumvent potential complications when using imported C code in BSV.

### D.5.1 Only use BSV types corresponding to C types

The data passed between BSV and C have the standard BSV packed representations in bits. It can be tricky in C to deal with, say, a BSV 13-bit value that, in C, is a `uint16_t`. We recommend only using BSV arguments/results whose sizes are multiples of 8-bits so that they map exactly to bytes in C.

Even for BSV structs and vectors, only use structs and vectors whose elements are 8-bit byte-sized. Accessing/updating the components in C is vastly easier with these constraints.

If an original BSV type  $T$  does not have “byte-aligned” components, we often define a new type  $T2$  with byte-aligned components and copy values from  $T$  to  $T2$  before the call (for arguments) or from  $T2$  to  $T$  (for results). These is extra, technically unnecessary “copying” of data, but the resulting simplification of the C code is well worth it.

For structs, be careful that C structs may have “gaps” or “padding” between fields to improve word-alignment, whereas BSV structs are tightly packed. Thus a BSV struct and a C struct, though they may look identical, may have different data representations.

### D.5.2 Use `ActionValue#(t)` for imported C function’s result

In an `import "BDPI"` declaration the type of the result the function may be `Action`, or `ActionValue#(t)` or some type  $t$ . As discussed in Section 5.6.1, in BSV the last case (not an action or actionvalue) is taken as a strong guarantee of mathematical purity (and absence of side-effects); the `bsc` compiler may merge multiple invocations that have the same arguments, and it may move a function elsewhere in the code with different control conditions. These optimizations can lead to nasty surprises when the C function is not really pure (including core-dumps if it relies on prior initializations).

We recommend normally using `ActionValue#(t)` for the result of any imported C function (or `Action` if the C result is `void`) to avoid surprises.

Use a non-action/actionvalue return type *only* if you are absolutely sure that the C function is pure and does not need any other initializations before invocation (*e.g.*, the C library `toupper` character transformer or `cos()` trigonometric function). Even for these, be safe and use an actionvalue function.

## D.6 Example: Memory Model for Drum and Fife

For Fife and Drum, because our focus is on the CPU module, we implement the memory system in C for convenience. In addition to the speed reason (not slowing down simulation), C is also more convenient for reading in memhex32 and ELF files.

The corresponding C function prototype, in a C file is:

```
1 #ifdef __cplusplus
2 // 'C' linkage is necessary for linking with Verilator object files
3 extern "C" {
4 void c_mems_devices_req_rsp (uint8_t          *result_p,
5                               const uint64_t   inum,
6                               const uint32_t   req_type,
7                               const uint32_t   req_size_code,
8                               const uint64_t   addr,
9                               const uint32_t   client,
10                             uint8_t          *wdata_p);
11 }
12#endif
```

# Index of BSV topics

/\*, start of block comment (until-\*), 5-13  
//, start of comment-to-end-of-line, 5-13  
? (don't care literal value), 6-4

AAAA\_AAAA, the default don't care value, 6-4  
**Action**  
    as a first-class type, 11-5  
**Action** type of expression with side-effects, 5-7  
    **Action**: primitive type of actions, 10-4  
    **Action**: type of pure side-effect expressions, 8-5  
    action-endaction blocks, 10-4  
    actions, 10-4  
    **ActionValue** type of expression with side-effects, 5-7  
    assertions for debugging, 12-4

Binary notation for integer literals, 5-4  
**Bit**  
    Integer type, 5-3  
    Bit Vectors, 5-2  
        **extend**, 5-3  
        **truncate**, 5-3  
        **zeroExtend**, 5-3  
        operators on, 5-2  
        slices of, 5-2  
**Bool**, 5-4  
    operators on, 5-4  
bus (hardware, bundle of wires), 5-6

**case** expression or statement, 5-17  
Combinational circuits, 5-6  
    data types, 5-8  
    purity, 5-6  
Combinational primitives, 5-6  
Comments  
    block, from /\* to matching \*/, 5-13  
    to end-of-line, starting with //, 5-13  
Conditional compilation, 5-19  
Connecting FIFOs, 8-11

deriving  
    Bits, 6-3

Fshow, 6-3  
**deriving Bits**, 5-12  
**deriving Eq**, 5-12  
**deriving FShow**, 5-12  
\$display has Action type, 10-4  
Don't care literal value ?, 6-4  
Drum  
    as a set of concurrent FSMs, 10-2  
    as an FSM, 10-2  
**DUT**, 12-1

**endpackage**, 4-1  
enum types, 5-11  
**export** statement, 4-3  
Exporting types abstractly, 4-4

field  
    of a **struct**, 6-1

**Fife**  
    as a set of concurrent FSMs, 10-3

**FIFO**, 8-8  
    **mkBypassFIFO**  
        module (constructor), 8-13  
    **mkFIFO**  
        instantiation, 8-9  
        module (constructor), 8-9  
        reset value, 8-9  
    **mkPipelineFIFO**  
        module (constructor), 8-13  
    **mkSizedFIFO**  
        module (constructor), 8-9  
        strongly-typed, 8-10

**FIFO**  
    **pop** method, 8-8  
    type of stored value, 8-8

**FIFO** interface, 8-8  
**FIFO** interface methods, 8-8  
**FIFO** interface, 8-8  
**FIFO**\_O  
    interface transformer from FIFO, 8-11  
    **pop\_o** method, 8-10  
**FIFO**\_I semi-fifo interface, 8-10  
**FIFO**\_O: semi-fifo interface, 8-10  
Finite State Machines, 10-1

**Fmt**  
 formated object, 12-3  
 Formatted output using `Fmt` objects, 12-3

**FShow**  
 Standard Typeclass containing the `fshow()` function, 12-3

**fshow**  
 Standard overloaded function producing `Fmt` objects for various types, 12-3

**FSMs**, 10-1  
 concurrent *vs..* sequential, 10-2  
 sequential *vs..* concurrent, 10-2  
`Stmt`: type of argument to FSM module constructors, 10-6

Fully-qualified imported names, 4-4

**functions**  
 application, 5-7  
 definition, 5-7

Haskell, monadic types similarity, 5-8  
 Hexadecimal notation for integer literals, 5-4

Identifier syntax, 5-12

**Identifiers**  
 Enum constant: initial upper-case letter, 5-12  
 First letter lower- or upper-case, 5-12  
 Ordinary: initial lower-case letter, 5-12  
 Type: initial upper-case letter, 5-12

if-then-else, 5-13  
 nested, 5-14

if-then-else: ordinary expression *vs* `StmtFSM` process, 10-6

**export** statement, 4-3

Importing C and C++ functions, D-1

**Int**  
 Integer type, 5-3

Integer types `Bit`, `Int`, `UInt`, `Integer`, 5-3

**Interface**  
 FIFO FIFO interface, 8-8  
`Reg` register interface, 8-5  
`RegFile` register file interface, 8-7  
 type, 8-2

interface  
 declaration, 4-4

**interface**  
 declaration (a module's API)  
 typical components in, 4-4

interface definition  
 typical components in, 4-6

Interface transformer functions, 8-11

internal behavion: rules, 8-2

**let**

binding an identifier with implicit type declaration, 6-4

**let-bindings** in Action blocks, 10-5

**Literals**  
 Binary integer notation, 5-4  
 Hexadecimal integer notation, 5-4

**member**  
 of a `struct`, 6-1

**Method**  
 invocation of module method, 8-4

**mkAutoFSM** module in `StmtFSM` library package, 10-7

**mkBypassFIFO**  
 module (constructor), 8-13

**mkConnection** for connecting compatible interfaces, 8-11

**mkFIFO**  
 instantiation, 8-9  
 module (constructor), 8-9  
 reset value, 8-9

**mkPipelineFIFO**  
 module (constructor), 8-13

**mkReg**  
 instantiation, 8-5  
 module (constructor), 8-5  
 reset value, 8-5

**mkRegU**  
 module (constructor), 8-6

**mkSizedFIFO**  
 module (constructor), 8-9

**Module**, 8-1  
 (persisitent) state, 8-1  
 behavior, 8-4  
 constructor, 8-4  
 instance, 8-4  
 instantiation, 8-4  
 interface, 8-1, 8-4  
 method invocation, 8-4

**mkBypassFIFO** module (constructor), 8-13

**mkFIFO** module (constructor), 8-9

**mkPipelineFIFO** module (constructor), 8-13

**mkReg** module (constructor), 8-5

**mkRegFileFull** module (constructor), 8-7  
 state, 8-4

**module**  
 declaration, 4-4  
 interaction via methods, 4-8

**module**  
 declaration  
 typical components in, 4-5

module instance hierarchy, 4-7  
 Monadic types  
     Haskell similarity, 5-8  
 Monomorphic Types (types without type-variables), 8-13  
 multiplexers, 5-13  
     cascaded/serial/priority, 5-13  
     parallel, 5-15  
 MUX, 5-13  
 Operators on Bit Vectors, 5-2  
 Overloading: Typeclasses and typeclass instances, 8-12  
 package, 4-1  
 packing of struct fields and vector elements, 6-3  
 parameterization, 5-17  
 Pattern matching  
     match statement for tuples, 6-6  
 Polymorphic Types, 8-13  
     Type variables (identifiers beginning with lower-case letter), 8-13  
 pop  
     FIFO method, 8-8  
 pop\_o  
     FIFO\_O method, 8-10  
 printf-style debugging, 12-2  
 Propagation delay, 5-7  
 \_read: register method, 8-5  
 RegFile, 8-7  
 Register, 8-5  
     <= register assignment, 8-6  
     implicit register read, 8-6  
     mkReg  
         instantiation, 8-5  
         module (constructor), 8-5  
         reset value, 8-5  
     mkRegU  
         module (constructor), 8-6  
     \_read method, 8-5  
 Reg register interface, 8-5  
 strongly-typed, 8-5  
 \_write method, 8-5  
 Register file, 8-7  
     methods, 8-7  
     mkRegFileFull  
         instantiation, 8-7  
         module (constructor), 8-7  
         reset value, 8-7  
 RegFile interface, 8-7  
 RegFile register file interface, 8-7  
 type of index, 8-7  
 type of stored value, 8-7  
 replicate vector library function to create a vector value, 17-11  
 rule  
     typical components in, 4-6  
 rule: the fundamental behavioral construct in BSV, 8-12  
 Rules, 10-3  
     vs. StmtFSM, 14-11  
 Body, of type Action, 14-1  
 Explicit condition, 14-1  
 Implicit/READY condition, 14-1  
 prioritizing explicitly, 17-7  
 Semantics of collection of rules, 14-6  
 Semantics of individual rule, 14-2  
 Syntax and types, 14-1  
 rules  
     internal behavior, internal processes, 8-2  
 Semi-FIFO  
     FIFO\_I semi-fifo interface, 8-10  
     FIFO\_O semi-fifo interface, 8-10  
 static elaboration, 4-7  
 StmtFSM  
     for-loop repetition, 10-8  
     mkAutoFSM module, 10-7  
     par blocks (fork-join concurrency), 10-8  
     vs. rules, 14-11  
     an abstraction of rules, 10-3  
     await: pausing until some condition, 10-7  
     if-then-else: process conditional, 10-6  
     in testbenches, 10-7  
     seq .. endseq: sequences of actions, 10-6  
     structured process, 10-3  
     translation into rules, 14-10  
     while-loop repetition, 10-6  
 struct  
     entire struct values, 6-4  
     field assignment/update, 6-5  
     field selection, 6-5  
     heterogeneous collection of values, 6-1  
     nested, 7-2  
 struct  
     type declaration, 6-3  
 structs  
     packing of fields, 6-3  
 (\* synthesize \*) attribute, 8-4  
 synthesize  
     attribute on modules for Verilog generation, 8-4, 8-14  
 Testbench, 12-1  
 Testbenches, 5-9  
     FSMs, 5-9

**truncate**, operation to shrink bit-width, [7-6](#)

Tuples, [6-5](#)

Type variables

- Identifiers beginning with lower-case letter, for polymorphism, [8-13](#)

Typeclass

- instance of, [5-12](#), [8-12](#)

Typeclasses, [5-12](#)

- BSV's "overloading" mechanism, [8-12](#)

Types

- [ActionValue](#), [5-7](#)

- [Action](#), [5-7](#)

- [Bit#\(n\)](#), [5-2](#)

- [Bool](#), [5-4](#)

- interface, [8-2](#)

- numeric, [5-18](#)

- of combinational circuits, [5-8](#)

- synonyms, [5-18](#)

- `valueOf`: value of a numeric type, [5-18](#)

**Integer**

- Unbounded (non-synthesizable) Integer type, [5-3](#)

**UInt**

- Unsigned integer type, [5-3](#)

Value Change Dump (VCD) waveform dumping, [12-5](#)

`valueOf`: value of a numeric type, [5-18](#)

VCD waveform dumping, [12-5](#)

**vector**

- library data type, [17-10](#)

- library `replicate` function, [17-11](#)

- of  $n$  bits *vs.* [Bit#\(n\)](#), [17-11](#)

- representation in bits, [17-11](#)

vectors

- packing of fields, [6-3](#)

Waveform dumping (VCDs), [12-5](#)

Wraparound arithmetic for fixed-width integer

- types, [5-4](#)

`_write`: register method, [8-5](#)

X values

- Notation for unassigned values in Verilog  
(no such concept in BSV), [6-5](#)

# Index of RISC-V topics

- Address alignment, [6-8](#)
- Architectural state, [3-3](#)
- Balancing concurrent paths in a pipeline, [17-9](#)
- BRAM (Block RAMs in FPGAs, which are SRAMs), [19-14](#)
- Branch prediction, [16-2](#)
- Bubble in a pipeline, [16-5](#)
- Bypassing, [16-6](#)
- Cache, [19-14](#)
  - miss penalty, [19-14](#)
  - non-blocking, [19-14](#)
- cache
  - hit-under-miss, [19-2](#)
  - non-blocking, [19-2](#)
- Commit the side-effects of an instruction, [16-4](#)
- CPI: cycles per instructions, [2-16](#)
- Decode function (`fn_Decode`), [7-3](#)
- Dispatch function (`fn_Dispatch`), [7-7](#)
- DMem, Data Memory [6-6](#)
- DMem (data memory), [6-6](#)
- DRAM (dynamic RAM), [19-14](#)
- Drum
  - CPU interface, [11-1](#)
  - CPU module actions, [11-5](#)
  - CPU module behavior, [11-12](#)
  - Skeleton module, [11-2](#)
- Execute Control function (`fn_EX_Control`), [7-11](#)
- Execute Integer function (`fn_EX_Int`), [7-13](#)
- Fetch function (`fn_Fetch`), [7-1](#)
- Fife
  - Skeleton module, [11-2](#)
- `fn_Decode` (Decode function), [7-3](#)
- `fn_Dispatch` (Dispatch function), [7-7](#)
- `fn_EX_Control` (Execute Control function), [7-11](#)
- `fn_EX_Int` (Execute Integer function), [7-13](#)
- `fn_Fetch` (Fetch function), [7-1](#)
- Formal verification, [13-4](#)
- Golden Reference Model, [13-2](#)
- GPRs: general purpose registers, [9-1](#)
- GPRs\_IFC interface for `mkGPRs`, [9-1](#)
- Harvard architecture, [6-7](#)
- Self-modifying code, [6-7](#)
- Hazards
  - Scoreboard for managing, [16-4](#)
- IMem, Instruction Memory [6-6](#)
- IMem (instruction memory), [6-6](#)
- In-flight memory transactions, [17-9](#)
- Instruction ordering
  - tags for, [16-7](#)
- IPC: instructions per cycle, [2-16](#)
- ISA
  - Architectural State, [2-2](#)
  - Assembly Language, [2-2](#)
  - Contract between software and hardware implementations, [2-2](#)
  - Enables software portability, [2-2](#)
  - Evolution of, [2-18](#)
  - Extensions to, [2-18](#)
  - Formal Specification in Sail, [2-2](#)
  - Instruction semantics, [2-2](#)
  - Instruction Set, [2-2](#)
  - Instruction Set Architecture, [2-1](#)
  - Instruction Set encoding in bits, [2-2](#)
  - PC, Program Counter, [2-2](#)
  - Program Counter, PC, [2-2](#)
  - Register File, [2-2](#)
  - RV32I base integer instructions, [2-7](#)
  - Sail Formal Specification language, [2-2](#)
- JIT (Just-in-time compiling), [6-7](#)
- Just-in-time compiling (JIT), [6-7](#)
- Machine code, [2-2](#)
- Memory
  - Address alignment, [6-8](#)
- BRAM (Block RAMs in FPGAs, which are SRAMs), [19-14](#)
- Cache, [19-14](#)
  - non-blocking, [19-14](#)

- DRAM (dynamic RAM), 19-14
  - latency, 17-9
  - Physical address, 19-14
  - pipelined, 17-9
  - Request, 6-7
  - Response, 6-9
  - SRAM (static RAM), 19-14
  - Virtual address, 19-14
  - Virtual Memory, 19-14
- Memory access
  - cache hit/miss, 3-4
  - latency, 3-4
  - one or more clock ticks, 3-4
  - viewed as a pipeline/queue, 3-4
- Micro-architecture, 2-2
  - Multi-core, 2-2
  - Multi-threaded, 2-2
  - Out-of-order, 2-2
  - Pipelining, 2-2
  - Speculation, 2-2
  - Superscalar, 2-2
- microarchitecture
  - out-of-order, 19-2
  - superscalar, 19-2
- misprediction
  - manage with epochs, 19-8
  - penalty, 19-8
  - shadow, 19-8
- Misprediction (wrong path instructions), 16-2
- `mkGPRs` a module wrapper around library `RegFile`, 9-2
  - `mkGPRs_synth` a module wrapper for `mkGPRs` for synthesizability, 9-3
- MMIO (Memory-Mapped Input Output), 16-8
- out-of-order microarchitecture, 19-2
- PC prediction, 16-2
  - BTB with hysteresis, 19-10
  - BTB, control-flow-indexed, 19-11
  - BTBs (Branch Target Buffers), 19-10
  - RAS (Return Address Stack), 19-11
  - redirection, 16-3
- PC-prediction
  - misprediction, 19-8
- Physical address, 19-14
- Pipeline
  - balanced and unbalanced, 19-5
  - Bubble, 16-5
  - slack, 19-5
- Pipeline trace, 19-2
- Prediction, 16-2
  - redirection on misprediction, 16-3
- `rg_epoch` register for managing mispredictions, 16-3
- Scoreboard, to manage register read/write hazards, 16-4
- Short-circuiting (bypassing), 16-6
- Simulator
  - trusted, functional, 13-2
- Speculation of instructions, 16-4
- Speculative instruction, 16-4
- Split-phase memory transaction, 3-4
- SRAM (static RAM), 19-14
- Store Buffer, 16-7
- superscalar microarchitecture, 19-2
- Tags
  - `EXEC_TAG_CONTROL`, 16-7
  - `EXEC_TAG_DIRECT`, 16-7
  - `EXEC_TAG_DMEM`, 16-7
  - `EXEC_TAG_IALU`, 16-7
  - for proper instruction ordering, 16-7
- Tandem verification
  - symmetric, 13-7
- TCM (tightly coupled memory), 19-14
- Timing
  - slack, 19-5
- TLB
  - miss penalty , 19-14
- TLB translation lookaside buffer, 19-14
- Trusted functional simulator, 13-2
- Virtual address, 19-14
- Virtual Memory, 19-14
  - Page Table, 19-14
  - Page Table Walk, 19-14
  - Pages, 19-14
  - PTW (Page Table Walk), 19-14
  - TLB translation lookaside buffer, 19-14
- Wrong-path due (mispredicted instructions), 16-2
- x0: Special “always zero” register, 9-1

# Bibliography

- [1] F. Baader and T. Nipkow. *Term Rewriting and All That*. Cambridge University Press, 1998. ISBN 0 521 45520 0.
- [2] Bluespec, Inc. BSV Guide, 2022 (first version 2000).
- [3] E. Borin. *An Introduction to Assembly Programming with RISC-V*. 2021 (Revised: May 9, 2022). PDF online: <https://riscv-programming.org/book.html>.
- [4] K. Chandy and J. Misra. *Parallel Program Design: A Foundation*. Addison Wesley, 1988.
- [5] P. Dabbelt, M. Clark, and A. Bradbury. RISC-V Assembly Programmer's Manual, Recent update: Jun 29, 2023. Online: <https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md>.
- [6] P. David and W. Andrew. *The RISC-V Reader: An Open Architecture Atlas*. Strawberry Canyon, 2017. ISBN-10: 0999249118, ISBN-13: 978-0999249116. Available in bookstores.
- [7] E. W. Dijkstra. *A Discipline of Programming*. Prentice Hall, 1976.
- [8] A. J. Dos Reis. *RISC-V Assembly Language*. 2019. ISBN-10: 1088462006, ISBN-13: 978-1088462003. Available on Amazon.com.
- [9] J. L. Hennessy and D. A. Patterson. *Computer Architecture: A Quantitative Approach, 6th Edition*. Morgan Kaufmann, November 23 2017. The Morgan Kaufmann Series in Computer Architecture and Design. eBook ISBN 9780128119068. paperback: 978-0128119051.
- [10] IEEE. IEEE Standard VHDL Language Reference Manual, IEEE Std 1076-1993, 2002.
- [11] IEEE. IEEE Standard Verilog (R) Hardware Description Language, 2005. IEEE Std 1364-2005.
- [12] IEEE. IEEE Standard for Standard SystemC Language Reference Manual, January 9 2012. IEEE Std 1666-2011.
- [13] IEEE. IEEE Standard for System Verilog—Unified Hardware Design, Specification and Verification Language, 21 February 2013. IEEE Std 1800-2012.
- [14] J. Kamperman. *Compilation of Term Rewriting Systems*. PhD thesis, University of Amsterdam, 1996.
- [15] J. Klop. *Term Rewriting Systems*, volume 2, pages 1–116. Oxford University Press, 1992.
- [16] L. Lamport. *Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers*. Addison-Wesley Professional (Pearson Education), 2002.
- [17] Metayer, C. and J.-R. Abrial and L. Voisin. The Event-B Language, May 31 2005. <http://rodin.cs.ncl.ac.uk/deliverables.htm>.
- [18] D. A. Patterson and J. L. Hennessy. *Computer Organization and Design RISC-V Edition (2nd Edition): The Hardware Software Interface*. Morgan Kaufman, 2020. The Morgan Kaufmann Series in Computer Architecture and Design. ISBN-10: 0128203315, ISBN-13: 978-0128203316. Available in bookstores.

- [19] S. Peyton Jones (Editor). *Haskell 98 Language and Libraries: The Revised Report*. Cambridge University Press, May 5 2003. [haskell.org](http://haskell.org).
- [20] D. L. Rosenband. The Ephemeral History Register: Flexible Scheduling for Rule-Based Designs. In *Proc. MEMOCODE'04*, June 2004.
- [21] D. L. Rosenband. *A Performance Driven Approach for Hardware Synthesis of Guarded Atomic Actions*. PhD thesis, Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, August 2005.
- [22] SHAKTI Development Team @ iitm '20 [shakti.org.in](http://shakti.org.in) (Indian Institute of Technology, Madras, India). RISC-V ASSEMBLY LANGUAGE, Programmer Manual Part I, 2020. PDF online: <https://shakti.org.in/docs/risc-v-asm-manual.pdf>.
- [23] Terese. *Term Rewriting Systems*. Cambridge University Press, 2003. Cambridge Tracts in Theoretical Computer Science 55.
- [24] Various authors. BSV Language Reference Guide, 2024 (revised frequently).
- [25] Various authors. BSV Libraries Reference Guide, 2024 (revised frequently).
- [26] A. Waterman and K. Asanović. The RISC-V Instruction Set Manual Volume I: Unprivileged ISA, December 13 2019. Document Version 20191213.. PDF online (and newer versions, if any): <https://riscv.org/technical/specifications/>.
- [27] A. Waterman, K. Asanović, and J. Hauser. The RISC-V Instruction Set Manual Volume II: Privileged Architecture, December 4 2021. Document Version 20211203. PDF online (and newer versions, if any): <https://riscv.org/technical/specifications/>.