

# IPA Project report

shaantanu kulkarni  
2019102031

The SEQ implementation of our processor was too slow and so we modified the design of our processor by adding registers between each stage and by using techniques such as forwarding , stalling and addition of bubbles to rectify the predicted PC.

The diagram below shows how each stage in connected with other using pipelining registers.



## Pipeline control logic

The figure below shows how the Pipeline control logic is connected with the pipelining registers.



My processor handles these four case with the help of this pipeline logic :

- Load/use hazards. The pipeline must stall for one cycle between an instruction that reads a value from memory and an instruction that uses this value.
- Processing ret. The pipeline must stall until the ret instruction reaches the write-back stage.
- Mispredicted branches. By the time the branch logic detects that a jump should not have been taken, several instructions at the branch target will have started down the pipeline. These instructions must be canceled, and fetching should begin at the instruction following the jump instruction.
- Exceptions. When an instruction causes an exception, we want to disable the updating of the programmer-visible state by later instructions and halt execution once the excepting instruction reaches the write-back stage.

# Individual stages of the pipelined architecture

## 1) Fetch and PC selection:



This stage, depending on `f_pc`, will read 10 bytes and decide how many bytes it actually has to read. On the basis of this decision, the Predict PC block predicts the location of the next instruction in the instruction memory.

The instruction memory is implemented by using a 8 X 2048 memory array and the values in the memory are initialized at the beginning of the first cycle of the processor.

```

module InstructionMemory(f_pc,f_ibyte,f_ibytes,imem_error);
    input [63:0] f_pc;
    output reg[7:0] f_ibyte;
    output reg[71:0] f_ibytes;
    output reg imem_error;

    reg [7:0] instruction_mem [2047:0];

    initial
        begin
            $readmemh("./rom.mem", instruction_mem);
            $display("init");
        end
    always @(f_pc)
    begin
        $display(instruction_mem[f_pc]);
        f_ibyte <= instruction_mem[f_pc];
        f_ibytes[71:64] <= instruction_mem[f_pc+1];
        f_ibytes[63:56] <= instruction_mem[f_pc+2];
        f_ibytes[55:48] <= instruction_mem[f_pc+3];
        f_ibytes[47:40] <= instruction_mem[f_pc+4];
        f_ibytes[39:32] <= instruction_mem[f_pc+5];
        f_ibytes[31:24] <= instruction_mem[f_pc+6];
        f_ibytes[23:16] <= instruction_mem[f_pc+7];
        f_ibytes[15:8] <= instruction_mem[f_pc+8];
        f_ibytes[7:0] <= instruction_mem[f_pc+9];

        imem_error <= (f_pc < 64'd0 || f_pc > 64'd2047 ) ? 1'b1:1'b0;
    end
endmodule

```

## 2) Decode Stage and Write back:



This stage, on the basis of the D pipelining registers, decides what value to pass to the E pipelined registers, ie,

- d\_ValA
- d\_ValB
- dstE
- dstM
- srcA
- srcB

Also, since there can be load/use hazards we need to add two more blocks namely

- Sel + Fwd A
- Fwd B

The block labeled “Sel+Fwd A” serves two roles. It merges the valP signal into the valA signal for later stages in order to reduce the amount of state in the pipeline register. It also implements the forwarding logic for source operand valA. Similar functionality is of Fwd B.

The merging of signals valA and valP exploits the fact that only the call and jump instructions need the value of valP.

The priority order for Sel + Fwd A is

| Data word | Register ID | Source description                          |
|-----------|-------------|---------------------------------------------|
| e_valE    | e_dstE      | ALU output                                  |
| m_valM    | M_dstM      | Memory output                               |
| M_valE    | M_dstE      | Pending write to port E in memory stage     |
| W_valM    | W_dstM      | Pending write to port M in write-back stage |
| W_valE    | W_dstE      | Pending write to port E in write-back stage |

### 3) Execute Stage:



The hardware units and the logic blocks are identical to those in SEQ architecture. We can see the signals  $e\_valE$  and  $e\_dstE$  directed toward the decode stage as one of the forwarding sources. One difference is that the logic labeled “Set CC,” which determines whether or not to update the condition codes, has signals  $m\_stat$  and  $W\_stat$  as inputs.

The “Cond” block decides whether  $e\_Cnd$  should be ‘1’ or ‘0’ on the basis of  $E\_ifun$  and  $CC$ .

### 4) Memory Stage:



This stage has the main component that is Data memory where data can be stored . The read and write wires determines which operation has to be followed and mem\_addr is the address of the memory location where data has to be written ie “M\_ValA” or read ,ie, to m\_ValM. It also passes the necessary data to the next pipeline registers.

# Set of Instructions supported by this processor

- rrmovq
- irmovq
- rmmovq
- mrmovq
- Opq
- jXX
- cmovXX
- pushq
- popq
- halt
- nop
- call
- ret

All the instruction in the Y86 instruction set are supported in the processor.  
Each of the stages have been tested on individual testbench before integrating  
all the modules together into one main module namely ‘cpu.v’ .  
The figure below shows the OPCODES of each of these instructions.

|                  |             |
|------------------|-------------|
| halt             | 0 0         |
| nop              | 1 0         |
| rrmovq rA, rB    | 2 0 rA rB   |
| irmovq V, rB     | 3 0 F rB V  |
| rmmovq rA, D(rB) | 4 0 rA rB D |
| mrmovq D(rB), rA | 5 0 rA rB D |
| Opq rA, rB       | 6 fn rA rB  |
| jXX Dest         | 7 fn Dest   |
| cmovXX rA, rB    | 2 fn rA rB  |
| call Dest        | 8 0 Dest    |
| ret              | 9 0         |
| pushq rA         | A 0 rA F    |
| popq rA          | B 0 rA F    |

# Testing the processor on standard algorithms

The processor has been tested on the following algorithms for calculating:

- HCF of two numbers
- Bubble sort

The following table shows the C++ code for the two algorithms

| Algorithm | HCF of two numbers                                                                                                                                                                                                                                                | Bubble Sort                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Code      | #include <iostream><br>using namespace std;<br>int main()<br>{<br>int n1, n2;<br>cout << "Enter two numbers:<br>";<br>cin >> n1 >> n2;<br>while(n1 != n2)<br>{<br>if(n1 > n2)<br>n1 -= n2;<br>else<br>n2 -= n1;<br>}<br>cout << "HCF = " << n1;<br>return 0;<br>} | #include<iostream><br>using namespace std;<br>int main ()<br>{<br>int i, j,temp,pass=0;<br>int a[10] = {10,2,0,14,43,25,18,1,5,45};<br>cout <<"Input list ...\\n";<br>for(i = 0; i<10; i++) {<br>cout <<a[i]<<"\\t";<br>}<br>cout<<endl;<br>for(i = 0; i<10; i++) {<br>for(j = i+1; j<10; j++) {<br>if(a[j] < a[i]) {<br>temp = a[i];<br>a[i] = a[j];<br>a[j] = temp;<br>}<br>}<br>pass++;<br>}<br>cout <<"Sorted Element List ...\\n";<br>for(i = 0; i<10; i++) {<br>cout <<a[i]<<"\\t";<br>}<br>cout <<"\\nNumber of passes taken to sort the<br>list:"<<pass<<endl;<br>return 0;<br>} |

## Simulation specific note:

- The output of each registers are displayed on the terminal after each clock cycle. This technique proved to be a good debugging tool.
- The Final state of the Data memory is outputted in ‘data\_out.mem’
- The initial state of the Date memory can be provided in ‘data.mem’ before the simulation is executed
- The instruction memory is loaded with instructions stored in ‘rom.mem’.

```
initial
begin
    $readmemh("./rom.mem", instruction_mem);
    $display("init");
end
```

# Assembly code and simulations

## HCF of two numbers:

The following assembly program calculates the HCF of two numbers , for the current test case we take it to be 56 and 76.

$$\text{HCF}(56,76) = 4.$$

```
█ HCF_assembly.txt
1    irmovq $56 %rax
2    irmovq $76 %rbx
3    check:
4        rrmovq %rbx %rcx
5        subq %rax %rbx
6        jg swap
7        jl subtract
8        halt
9    subtract:
10       subq %rax %rbx
11       jmp check
12    swap:
13       rrmovq %rax %rcx
14       rrmovq %rbx %rax
15       rrmovq %rcx %rbx
16       jmp subtract
```

The encoded HEX values

|                          |
|--------------------------|
| 30 F0 000000000000000038 |
| 30 F300000000000000062   |
| 20 31                    |
| 61 01                    |
| 72 000000000000000036    |
| 76 00000000000000002B    |
| 00                       |
| 61 03                    |
| 70 000000000000000014    |
| 20 01                    |
| 20 30                    |
| 20 13                    |
| 70 00000000000000002B    |

The complete hex codes is in 'HCF.mem'

- The corresponding HEX code can be found in 'HCF.mem'. During conversion to HEX code , the labels were removed and exact memory locations were hardcoded for the conditional/nonconditional jump instructions

As expected , the HCF value will be stored in both the registers rax and rbx (thats the condition for halt ,ie, when rax == rbx)

The figure below shows the values in rax and rbx

```
Current register status:
rax :          4   rsp:          0   r8 :          0   r12 :          0
rcx :          0   rbp:          0   r9 :          0   r13 :          0
rdx :          0   rsi:          0   r10 :          0   r14 :          0
rbx :          4   rdi:          0   r11 :          0   F   : XX
```

## *Outputs of GTKWAVE:*



The gtkwave output for complete simulation time ,ie, about 1500 ns.  
The clock value toggles after every 10ns and thus the timeperiod is 20ns.

We know that the conditional jump happens even before checking if the conditions were true. If it is found that the jump we took was a misprediction for the PC value, we generate bubble signal ,ie, basically NOPs in decode and execution because by the time the processor realises that it took wrong jump, two operations would have already been fetched by the fetch stage .

*In the assembly code, we are using four jump instructions, two conditional and two non-conditional.*

If the prediction was correct and the jump had to be taken, none of *E\_bubble* or *D\_bubble* signal is set ‘1’. and the processor continues with normal execution.

If the prediction was false,  $E_{bubble}$  and  $D_{bubble}$  are set true and the processor fixes its PC value and normal operation starts thereafter.

In our assembly code, the first jump condition for the very first time when is fetched, it compares %rax and %rbx and jumps if %rax > %rbx

Since this is not true for the first time,  $E_{\text{bubble}}$  and  $D_{\text{bubble}}$  will be generated.

The second jmp checks if  $\%rax > \$rbx$ , and since its true E\_bubble and D\_bubble wont be generated. This can be seen from the figure below.



Also , we can see the following NOP caused by the bubbles in following instruction.

We also see that the simulation is terminated when  $W_{stat} = SHLT$  (010 it is three bit register in this processor) and '0' in  $W_{icode}$ .



## Some other Test cases:

$$HCF(579,8764) = 1$$

```
Current register status:
rax :          1  rsp:          0  r8 :          0  r12 :
          0
rcx :          0  rbp:          0  r9 :          0  r13 :
          0
rdx :          0  rsi:          0  r10 :         0  r14 :
          0
rbx :          1  rdi:          0  r11 :         0  F   : XX
```

$$HCF(150,225) = 75$$

```
Current register status:
rax :          75  rsp:          0  r8 :          0  r12 :
          0
rcx :          0  rbp:          0  r9 :          0  r13 :
          0
rdx :          0  rsi:          0  r10 :         0  r14 :
          0
rbx :          75  rdi:          0  r11 :         0  F   : XX
```

## **Sorting of an array:**

The following program takes the number of elements in array and stores in %rcx and then uses bubble sort to sort the array stored in the Data memory initially and store it back in data memory.

|    | The encoded HEX values                 |
|----|----------------------------------------|
| 1  | 30 FB 00000000000000000000000000000000 |
| 2  | 30 F0 00000000000000000000000000000000 |
| 3  | 30 F3 00000000000000000000000000000000 |
| 4  | 50 13 00000000000000000000000000000000 |
| 5  | 30 F2 00000000000000000000000000000000 |
| 6  | 61 21                                  |
| 7  | rrmovq %rcx %rdx                       |
| 8  | rrmovq %rax %rbx                       |
| 9  | 20 12                                  |
| 10 | outloop:                               |
| 11 | subq %rdx %rcx                         |
| 12 | rrmovq %rcx %rdx                       |
| 13 | rrmovq %rax %rbx                       |
| 14 | 20 03                                  |
| 15 | inloop:                                |
| 16 | mrmovq 0(%rbx) %r8                     |
| 17 | mrmovq 1(%rbx) %r9                     |
| 18 | rrmovq %r9 %r10                        |
| 19 | subq %r8 %r10 jl .swap                 |
| 20 | 61 8A                                  |
| 21 | restore:                               |
| 22 | rrmovq %r8 0(rbx)                      |
| 23 | rrmovq %r9 0(rbx)                      |
| 24 | 72 00000000000000000000000000000000    |
| 25 | addq %r11 %rbx                         |
| 26 | subq %r11 %rdx                         |
| 27 | je .inloop                             |
| 28 | 61 B2                                  |
| 29 | subq %r11 %rcx                         |
| 30 | je .outloop                            |
| 31 | 61 72                                  |
| 32 | halt                                   |
| 33 | 61 B1                                  |
| 34 | swap:                                  |
| 35 | 61 71                                  |
| 36 | rrmovq %r8 %r10                        |
| 37 | rrmovq %r9 %r8                         |
| 38 | 76 00000000000000000000000000000000    |
| 39 | rrmovq %r10 %r9                        |
| 40 | 60 B3                                  |
| 41 | jmp .restore                           |
| 42 | 20 8A                                  |
| 43 | 20 98                                  |
| 44 | 20 A9                                  |
| 45 | 70 00000000000000000000000000000000    |
| 46 | 00                                     |
| 47 | 20 9A                                  |
| 48 | 76 00000000000000000000000000000000    |
| 49 | 059                                    |

- The corresponding HEX code can be found in ‘sorting.mem’. During conversion to HEX code , the labels were removed and exact memory locations were hardcoded for the conditional/nonconditional jump instructions

The input array that is loaded in the Data memory is stored in `data.mem` and the final state of the memory is stored in `data_out.mem`

### *INPUT array*

### *OUTPUT array*

|   | data.mem |
|---|----------|
| 1 | 00000006 |
| 2 | 00000004 |
| 3 | 00000003 |
| 4 | 00000002 |
| 5 | 00000001 |
| 6 | 0000000A |
| 7 | 00000000 |

```
data_out.mem
1 // 0x00000000
2 0000000000000006
3 0000000000000000
4 0000000000000001
5 0000000000000002
6 0000000000000003
7 0000000000000004
8 000000000000000a
9 xxxxxxxxxxxxxxxxxx
```

From the above screenshots we can see that the array is sorted properly and stored in memory back.

- The number of elements in array is stored at location 0x00000000 say n and the following n locations store the unsorted elements of the array
- The simulation executes ‘bubble sort’ to compare each consecutive element of this array and the swap them accordingly
- Time complexity  $n*(n-1) \sim O(n^2)$



The above figure is not the complete simulation time as it wasn't fitting in one frame. It shows first few cycles from the start.

- The M\_ValA stores the data that has to be written in the Data memory
  - mem\_addr has the address from where the data has to be read or where it has to be written
  - M\_dstM shows the register address for write back
  - We can see that mem\_addr ranges from 1-6 because our data is stored from 0x00000001 to 0x00000006
  - M\_dstM has values 8,9 because we are using %r8 and %r9 for comparing during sorting
- %rcx stores the number of elements and acts as counter for outer loop  
- %rdx stored inner loop counter  
- %r11 , %rax are used to store constant values  
- %rbx is the memory location pointer for the array



This is the later part of the same simulation showing E\_bubble,D\_bubble,CC(outf), W\_stat, read and write enables



*Some intermediate cycles of the same simulation*

### Some other test cases:

1)



*gtkwave output of intermediate cycles*



2)



*For further testing, the data input can be changed dynamically in data.mem and on running the simulation , the output of sorted array can be found in data\_out.mem.*

### ***Instructions to run the top module:***

- *The code that has to be tested needs to be written in rom.mem*
- *The HCF hexcodes are written in HCF.mem and that for sorting are in sorting.mem*
- *Copy paste the code into rom.mem for testing*
- *The commands to compile verilog module and show output are written in cpu.sh*
- *make sure you give necessary permissions to cpu.sh*
- *run ./cpu.sh*
- *You can see the outputs on gtkwave clearly*
- *For sorting algorithm , initialize the data.mem with input array in the following format*

*0x00000000 : n (number of elements)*

*0x00000001 : a1*

*0x00000002 : a2*

*.*

*.*

*0x(n)<sub>2</sub> : an*

*Store a\_is in hexadecimal format.*

### ***NOTE:***

*The Sequential architechture of the processor had also been implemented and can be found in ‘sequential folder’.*

*All the modules in sequential architechture were tested and then it was extended to pipelined architechture making the necessary changes.*

*Thank you!*