

---

# **ION -- free, open source MIPS32r2 compatible CPU core**

---

Revision 1 - March 11, 2014

Core Design Notes

---

© ION Dev Team 2014

## **OVERVIEW**

This document contains design notes meant for the CPU maintainer or for anyone interested in the design of the ION core.

None of the information in this document is necessary to use the core itself.

In time, this document will explain the design of the CPU, its implementation and the rationale behind them.

This document has been hastily put together and is very far from complete. It must be considered a work in progress.

## **CAVEATS**

This document is part description and part wishlist, since the core is in the middle of a major refactor.

Some of the specifications laid out in this document may change before the core is completed - some haven't even been agreed on.

## Table of Contents

|                                                       |   |
|-------------------------------------------------------|---|
| 1.- ION Internal Memory Bus.....                      | 3 |
| Read Cycles.....                                      | 4 |
| Write Cycles.....                                     | 5 |
| Pending refactors for the new version of the CPU..... | 6 |
| 2.- Structural Description.....                       | 7 |
| PC logic.....                                         | 7 |
| 3.- Instruction Execution.....                        | 8 |
| BFC02A0: 00641020 ADD r2, r3, r4 .....                | 8 |

## **1.- ION Internal Memory Bus**

This is the memory bus used internally in the core; it's not used outside the **ion\_core** entity or in its external interfaces so its details will only interest you if you want to debug or develop the core itself. For lack of a better name, this document will call it the "ION Bus".

It is a plain, point-to-point, pipelined 32-bit bus meant to be connected to synchronous memories of the usual FPGA BRAM kind. Its operation reflects some of the quirks of the pipeline, which makes it unsuitable for general purpose use.

The bus is used to interface entity **ion\_cpu** with the the caches. The CPU is always the bus master.

It is formally encoded in two record data types, one for the master outputs (**MOSI**) and other for the master inputs (**MISO**). The signals are described in the following table.

*Table 1: ION Internal Memory Bus Signals*

| Signal               | Width | Description                                    |
|----------------------|-------|------------------------------------------------|
| <b>t_cpumem_mosi</b> |       | <b>Master output, slave input</b>              |
| addr                 | 32    | Active high synchronous reset.                 |
| rd_en                | 1     | Read Enable.                                   |
| wr_be                | 4     | Write Byte Enable. Bit 0 is for wr_data[7..0]. |
| wr_data              | 32    | Data bus, write.                               |
| <b>t_cpumem_miso</b> |       | <b>Master input, slave output</b>              |
| rd_data              | 32    | Data bus, read.                                |
| mwait                | 1     | Asserted to stall a read or write cycle.       |

The bus only supports simple read and write cycles as described in the following sections.

## Read Cycles

For a read cycle without any wait states, this is what happens on the bus on active (positive) clock edges:

- Edge 1: Master drives **MOSI.addr** and asserts **MOSI.rd\_en**.
- Edge 2: Master deasserts all **MOSI** signals.
- Edge 2: Slave drives **MISO.rd\_data** and deasserts **MISO.mwait**.
- Edge 3: Slave must drive **MISO.rd\_data** until this edge (for two clock cycles).

For a read cycle with wait states, the edge-by-edge sequence is this:

- Edge 1: Master drives **MOSI.addr** and asserts **MOSI.rd\_en**.
- Edge 2: Master deasserts all **MOSI** signals.
- Edge 2: Slave asserts **MISO.mwait** for as long as needed (say K cycles).
- Edge 2+K: Slave drives **MISO.rd\_data** and deasserts **MISO.mwait**.
- Edge 2+K+1: Slave must drive **MISO.rd\_data** until this edge (for two clock cycles).



Figure 2: **ION Bus Read Cycles**

Note that the bus slave must hold **MISO.rd\_data** valid for two clock cycles, otherwise the CPU might miss it.

When the CPU detects a load hazard (a load to register  $R_x$  in instruction N followed by a reference to register  $R_x$  in instruction N+1) the pipeline is stalled for one cycle so that instruction

N+1 uses the correct, pre-load value for register R<sub>x</sub> – this is a *load interlock*. When that happens, the bus slave must hold **MISO.rd\_data** valid for an extra cycle.

Since there is no way to know the pipeline is stalled for a load hazard, the slave must do this for all read transactions. In effect, the pipeline has been simplified at the expense of the bus slave.

Both CPU buses use the same cycles, but the code bus does not have interlocks; the code bus does not need its rd\_data input valid for two cycles but only one though for generality the above specs should be used for both buses until a refactor is done that fixes the undue complications.

In the present implementation of the CPU, the hazard detection logic (signal **load\_interlock**) has been commented out so the pipeline is interlocked for all load instructions. The logic is there but it has not been tested, that's why it is commented out.

## Write Cycles

Write cycles are more straightforward:

- Edge 1: Master drives **MOSI.addr** and **MOSI.wr\_data** and asserts **MOSI.wr\_be**.
- Edge 2: Master deasserts all **MOSI** signals.
- Edge 2: Slave asserts **MISO.mwait** if necessary for K cycles.
- Edge 2+K: Master remains stalled until this edge.



Figure 3: **ION Bus Write Cycles**

The master will remain stalled for as long as **MISO.mwait** is asserted.

The bus supports four 'byte lanes' as described in table 4.

*Table 4: ION BUS Byte Lanes*

| MOSI.wr_be | Byte Lane | Description                   |
|------------|-----------|-------------------------------|
| 1111       | 31 .. 0   | SW on address ending in 0100. |
| 1100       | 31 .. 16  | SH on address ending in 0010. |
| 0011       | 15 .. 0   | SH on address ending in 0000. |
| 1000       | 31 .. 24  | SB on address ending in 0011. |
| 0100       | 23 .. 16  | SB on address ending in 0010. |
| 0010       | 15 .. 8   | SB on address ending in 0001. |
| 0001       | 7 .. 0    | SB on address ending in 0000. |

Note that the CPU in its present state does not support unaligned word writes or reads – it relies on traps for those operations. These operations are to be implemented in the new version of the core.

### **Pending refactors for the new version of the CPU**

This read cycle interlock stuff is something that unduly complicates the bus and should be fixed before going on with the new version of the CPU.

Specifically, the following refactors are necessary before starting the development of a new cache:

1. Add a control signal indicating an interlock so the bus slave knows it has to drive **MISO.rd\_data** for an extra cycle.
2. Alternatively, implement a fix for the interlock problem transparent to the bus slave.
3. Implement a properly decoded internal interlock signal (uncommenting the interlock logic), with a minimal test in the *opcodes* program.

Other refactors that would be desirable and need no radical changes to the core:

1. Unaligned loads and stores should be implemented.
2. The CPU might not stall in write cycles with wait states: it might go on with its business while the write operation is completed, and only stall if it has to access data memory while MISO.mwait is asserted.

## 2.- Structural Description

What follows is a description of the implementation of some selected parts of the core. This section will eventually cover the entire circuit in a later revision of this document.

### Core Structure

The **ion\_core** entity wraps the CPU itself (ion\_cpu) together with its caches and a TCM (Tightly Coupled Memory) block for each data and code:



Figure 5: **Core Interconnect**

Figure 5 depicts in detail the structure of the interconnect within the **ion\_core** entity.

All the named signals are ION buses unless otherwise noted, and are implemented as a **x\_miso/x\_mosi** pair. The arrow head points from the master to the slave.

There is an optional TCM block for data and another for code. Note that the code TCM is reachable from the data bus through an arbiter which imposes a 1 clock cycle delay penalty for the data bus and no delay for the code bus. The code TCM can also be initialized with constant data at FPGA loadup but in the current state of the code it can't be write protected.

The location of the TCM blocks in the address map is meant to be programmable through a CP0 register but in the current version it is hardwired to a fixed location useful only for testing.

The dashed boxes wrap the caches, which are meant to be optional – each cache can be omitted by setting its size to zero in the module generic list. In the current state of the code the caches are not implemented but the interconnect is.

Note that the various bus multiplexors and arbiters have been labelled with the entity name and color coded.

The interconnect interfaces the CPU to its caches and TCM memories using ION buses for simplicity. But external interfaces are implemented as Wishbone for easier integration with existing IP.

### **Cache Wishbone Ports**

Each of the caches is the master of a Wishbone bus that it uses for refills. The precise behavior of this WB port is still to be defined – we might implement burst reads or some other features beyond the basic pipelined cycles. What's pretty sure is that the WB buses need only implement pipelined accesses since they are meant for interfacing to internal peripherals – such as memory controllers – and those are expected to be synchronous.

At any rate, the WB master functionality is supposed to be implemented within the cache itself so no need for a WB bridge is anticipated here.

The feature set supported by this port will be fully described in the core datasheet, since it is part of its external interface. No need to do so here.

### **Direct Wishbone Port**

There is a third WB port through which the CPU can access non-cached devices bypassing the caches. The core uses a ION-to-Wishbone bridge to provide this port.

The CPU does not support bursts in the ION bus so the WB bridge does not need to support them either. On the other hand, it will have to support cycle retrials, if the standard mandates them. The features of this port are left to be specced in a later revision of this document.

The datasheet will describe the functionality of this port, since it is part of the core external interface.

**The above description is sketchy and incomplete, to be fleshed up in a later revision.**

## PC logic

Figure 6 shows a simplified diagram of the PC update logic. This logic handles the PC increments and the regular (non-exception) jumps and branches.

In the diagram, rectangles represent registers whereas green and yellow pointed shapes represent connections to other parts of the core.



Figure 6: **PC Update Logic**

As was to be expected, the Pc logic is a plain loadable counter. The low 2 bits of PC are not stored in the 30-bit **p0\_pc\_reg** signal, so the counter increments by 1, not 4.

When no exception is going on, signal **p0\_pc\_reg** is unconditionally loaded with the new value; but the increment itself is conditional – the value added will be zero if the pipeline is stalled or an exception is pending. This is a slightly counterintuitive implementation detail that should be simplified.

The only omission in this diagram is that it does not include the logic that generates the EPC value – the value stored in EPC upon exceptions. This logic is slightly different from that of **CODE\_MOSI\_0.addr** and will be included in a final version of this document.

Also omitted is the logic that loads **p0\_pc\_reg** with the exception vector or the exception return value; again, this remains to be done.

### ***3.- Instruction Execution***

---

In order to illustrate the internal working of the CPU, we will explain now the execution of a few representative instructions step by step as they traverse the pipeline.

All the instructions used as examples have been taken from a run of the opcodes test with the ion\_cpu\_tb.do simulation script. The execution context can be found in the program listing file.

Note that the listing uses the conventional register mnemonics, whereas in the examples the register index is used instead, for clarity.

#### **BFC02A0: 00641020 ADD r2, r3, r4**

This instruction is one of the simplest ones; all it does is read two registers from the register file, add them and write the result into a third register. No exceptions will be

In the waveform diagram 6, you can see that the simulation has been run with 3 wait cycles per code space memory access. Therefore, each pipeline stage takes 4 clock cycles instead of one.

Execution commences when the address for the instruction is placed in CODE\_MOSI.addr (1). As you see, the address is put on the bus one cycle before the PC register p0\_pc\_reg is updated in edge (3). Note that the Pc register is “shifted left” by 2 in the chronogram, because it’s missing the two LSBs.

The code memory puts the opcode on CODE\_MISO.rd\_data after CODE\_MISO.mwait is deasserted, at (9).

At this point, the register bank read address signals for ports Rs and Rt, p0\_rs\_num and p0\_rt\_num, take their values directly from CODE\_MISO.rd\_data – they do not take their values from the IR register p1\_ir\_reg.

This means that the register bank, implemented with synchronous memory, can be read at the same time the IR is loaded at (11), which saves one pipeline stage.

So from edge (1) to edge (11) the instruction is in its first pipeline stage.

At edge (11) the register outputs for Rs and Rt, p1\_rs and p1\_rt, take their correct values. At the same time, the alu control signal p1\_ac, and other datapath control signals, take their values from the IR. The ALU does its work between edges (11) and (13) – the ALU must complete its work in a single cycle even if the pipeline is stalled, the core does not have any multicycle path..

The second stage of the pipeline for this instruction takes from edge (11) to edge (17); as explained, it only takes 4 clock cycles because of the 3 delay cycles of every code fetch in this simulation.

Signals stall\_pipeline and its delayed version pipeline\_stalled control the stalling of the pipeline, as can be imagined. Each one controls a different part of all the stages of the pipeline and their behavior is arguably the trickiest part of the system. They will be explained carefully in a later version of this document.

Finally, at edge (17), the instruction completes its third pipeline stage by updating the Rd register – as enabled by the register bank WE p1\_rbank\_we. The new value is available for the next instruction because the register bank implements “data forwarding” when necessary.

Waveform Diagram 7: **Execution of “ADD” instruction**