

# Architecture Report

---

## Team (6)

### Team members:

|                        |        |        |
|------------------------|--------|--------|
| Ahmed Waleed Ahmed     | Sec: 1 | BN: 8  |
| Ali Khaled Ali Mohamed | Sec: 1 | BN: 35 |
| Omar Ahmed Desoky      | Sec: 1 | BN: 37 |
| Mostafa Waleed         | Sec: 2 | BN: 24 |



# Full Design:

## Design overview :



1. **Prediction stage** : responsible for prefetching jump address to update PC in fetch stage with and taking prediction decision (taken/not taken) according to FSM
2. **Fetch stage** : responsible for fetching an instruction from instruction memory and write it in the buffer [IF/ID] .. also responsible for selecting which pc to read
3. **Decode stage** : responsible for decoding instruction fetched and producing corresponding control signals to later stages .. also responsible for reading from decoded registers and writing to register specified at last buffer [MEM/WB]
4. **Data hazard detection unit** : responsible for stalling the pipe in load use case data hazard also used to stall pipe in case of RET/RTI/INT/two-words instructions as LDM...
5. **Execution stage** : contains ALU and responsible for the main operation of the instruction according to control signals from decode
6. **Forwarding unit** : responsible for choosing ALU inputs (data in register directly or some data forwarded from a further stage) .. providing full forwarding to prevent normal data hazards
7. **Memory stage** : responsible for reading or writing to data memory and controlling stack pointer “SP” according to control signals of the current instruction
8. **WB stage** : is just a mux to choose between memory output and ALU output to write into the required register or PC

Now we will explain each stage in detail...

## 1. Prediction Stage :



Contains 4 internal blocks :

### 1.1 Mini decode :

Decodes the instruction to 3 decisions :

1. Not jump instruction → jump address will not be taken
2. “Jmp” instruction → jump address will be taken always
3. “Jz” instruction → jump address will be taken or not taken depending on FSM decision

Also responsible for enabling hazard detection unit in case of jump instructions and passing the code of src register to register selector and hazard unit

## **1.2 Hazard detection unit :**

Stalls the pipe in case of not ready data “will be available after execution or after memory” to take the jump as predicted with data consistency correct

## **1.3 Jump forwarding unit :**

Forwards data in src1 of jump instruction if it is available at further stage “not ready in register file to take”

Forwards data in case of RTI/Call/Ret directly from memory to PC

## **1.4 FSM block :**

Contains a standard 2 bits FSM with update input coming from decode state (comparison result) and input as “taken/not taken” decision

The state of FSM “the 2 bits” are distinct for every jump instruction and defined by the address of the instruction (address fetched & address in decode) and by having the address of the instruction FSM can recall the instruction 2 bits to make a decision or update its state

\*\*N.B. jump prediction hazards are enhanced with software solution from assembler by adding NOPs (2 or 3 depending on case) before jump instruction to solve hazards  
“software stall”

## 2. Fetch Stage :



Contains 4 internal blocks :

### 2.1 Instruction memory :

Contains compiled instructions as specified by assembler .. has 16-bit width and is implemented as an asynchronous read-only memory

## 2.2 PC selector :

Selects PC to enter the PC according to the following 1-to-1 mapping :

RST → memory location {0,1}

INT (2nd cycle) → memory location {2,3}

Take jump address → Calculated jump address

Take corrected jump address → jump corrected PC

PC wb → PC from WB

Otherwise → updated PC “from adder”

## 2.3 INT handler :

Responsible for handling interrupt correctly in the pipe with two cycles...

1st cycle →

- pass INT signal to decode to create control signals that let the memory save current pc and flags
- Stop pc from fetching new instruction (PC enable =0)

2nd cycle →

- PC will take memLocation {2,3} and fetch it
- PC wb will be disabled and keep like that for two more cycles preventing previous instructions before INT from writing in PC

## 2.4 PC save determine :

Contains a 64-bit shift register that keeps track for current PC alongside with the former one giving us the ability to choose which PC to be saved “propagate in the pipe till it reaches memory” .. in all cases PC saved should be the following PC to the current instruction “PC updated” except :

1. if we are in the middle of a two-words instruction → second half is useless without the first one .. so I will save the old PC “the first part of the instruction” to refetch again at RTI
2. If we are in the middle of any jump instruction → because we can't risk saving wrongly-predicted PC before its correction

### 3. Decode Stage :



Contains 3 internal blocks :

### **3.1 Register file :**

Contains 8 general-purpose 32-bit registers + 1 flag 3-bit register

Responsible for writing to correct register at rising edge of the clock due to a swap or write instruction

Responsible for reading the correct register at the falling edge of the clock “after being written to” according to src 1,2 in the IR of the instruction decoded

### **3.2 Control unit :**

The core of the decode stage and the mastermind of the whole processor...

Responsible for decoding the instruction to a bunch of control signals used in this stage and propagates to following stages as well

### **3.3 Comparator :**

Compares between the zero flag and the prediction result in case of Jz instruction

If they are equal → prediction is correct and nothing to do

If one is different than the other → flush next instruction “wrongly fetched” and fetch stage take the corrected address in the next cycle

#### 4. Data hazard detection unit :



Stalls the pipe through 3 tools and for 3 reasons :

Tools :

1. PC enb : freezes PC  $\rightarrow$  don't take a new value .. refetch the same PC again
2. IF/ID enb : freeze buffer and don't take the new instruction  $\rightarrow$  same instruction is kept in the buffer
3. Insert bubble : vanishes the current decoded instruction and flushes all control unit signals with zeros instead  $\rightarrow$  same as NOP

Reasons :

1. Stall for INT is risen → used to stall the pipe to let INT and RTI uses the memory for two cycles without losing an instruction (for PC and flags) .. also used in cases of Call/Ret to stall the pipe until data is ready to be written back in pc
2. Stall from jump prediction is raised and the jump was taken → data to jump into was not ready so we need to stall
3. Load use case in any instruction (except two-words instructions) → DST of previous instruction equals any of the sources of the current one + instruction executed require memory read .. so the data won't be available until next cycle

\*\*N.B. : two-words instructions don't need stalling as they are already done in two cycles if it has load use or hasn't

e.g. : pop R1 "memory operation"

LDI R1,5 "load use case"

↳ Don't cause hazards

## 5. Execution stage :



Contains 2 internal blocks :

### 5.1 ALU :

Do the main functionality of the operation through its 11 operations :

|      |     |            |     |
|------|-----|------------|-----|
| Add  | Sub | And        | OR  |
| Swap | Inc | Dec        | SHL |
| SHR  | Not | Do nothing |     |

And outputs 2 outputs depending on the operation and 3 flags (neg, zero, carry) where they are forwarded directly to the register file from the ALU

### 5.2 Buffer Holder 1 :

Used in case of INT/RTI to make the memory stage hold the control signals of the operation for two cycles .. enough to push/pop both PC and flags to/from memory

## 6. Forwarding unit :



Provides full forwarding for all ordinary and swap operations .. block provides forwarding from memory or ALU through two multiplexers in the execution stage to determine ALU source 1 and 2 locations and forward them or take the usual data from the register file

## 7. Memory stage :



Contains 4 internal blocks :

### 7.1 INT controller :

The gateway for memory to decide address and data

For Address : “ decided by a mux outside the block”

INT/RTI/Call/Ret → always SP

Normal instructions → EA/IMM

For Data :

INT → ALU output “flags” then PC

Call → PC

Normal instructions → ALU output

Otherwise → data is redundant as memory doesn't write

### 7.2 Data memory :

The core of the memory stage ..

Has length 32-bit and writes on the rising edge of the clock and is read asynchronously .. stack part starts at address `X"00000FA0"` in memory

### 7.3 Buffer holder 2 :

Used in case of RTI to manipulate WB signals to write in PC in the first cycle and write to flags in the second cycle but doing nothing to the signals otherwise

### 7.4 SP incrementer/decrementer :

The register has its own incrementer/decrementer that works according to 2bit signal SP enb which also controls which result to take

11 → as in POP/RTI/Ret .. decrement SP, take SP value after decrement, neglect EA/IMM as input for INT controller

10 → as in Push/INT/Call .. increment SP, take SP value before increment, neglect EA/IMM as input for INT controller

0X → for other instructions .. incrementer/decrementer do nothing, SP value is neglected as address memory operations

## 8. WB stage :



A combinational stage “not a real stage” .. passes output data and signals to all other stages and stages handle them by themselves

Output data to be written - in PC or register file - is either ALU output or memory output chosen according to WB signal “MEM OR REG” from the control unit

**Not Implemented:**

- Memory Cache system

**Not Working properly:**

Everything is working properly.

## Test Cases Analysis:

In General, in order to get the right value, The value must be in WB stage, while the current instruction is at most in the decode stage.

Our Design fetches any immediate or effective values in **2 clocks**, the first clock we fetch (half instruction [fetching Registers]), and the 2<sup>nd</sup> clock we forward the value (2<sup>nd</sup> half instruction) from fetch to decode.

Our Data Memory, Its width is 32bit so it takes **1 clock** to save in Memory, While Instruction Memory is still 16bit takes **2 clocks** to fetch any Imm or effective address.

Stack pointers, Starts at 0000FA0.

To make it easier to understand the solution easily, we made an excel sheet which provides steps for each clock, so by running the do file along with the excel sheet you get the expected value cycle by cycle.

## 1. One Operand:

- V1 (No Forwarding/Hazard Detection/Flushing):



Hazards:

1. Inc R1: R1 takes the initial 0 value instead of FFFFF FFFF as NOT R1 still didn't write back its value to make it 1 which is not right.



2. NOT R2: R2 takes the initial 0 value instead of 10 as in R2 still didn't write back its value to make it FFFF FFFF which is not right.



3. Dec R2: R2 takes the value of 10 which is not the correct output of NOT R2 to make it 9 which is not right.



4. Out R1: R1 takes the value of 5 which is before Incrementing it ( 6 )



5. Out R2: R2 takes the value of 5 which is before Incrementing it ( 6 )



These Hazards can be solved by Adding No-operation before each Data Hazard to give it time to write back and then take the correct result in the hazardous operation.

## Solution:

| Current values | Instructions | F D E M W   | True Values              |
|----------------|--------------|-------------|--------------------------|
| R1 = FFFF FFFF | NOT R1       | F D E M W   | R1 = FFFF FFFF           |
|                | NOP          | F D E M W   |                          |
|                | NOP          | F D E M W   |                          |
| R1 = 0         | Inc R1       | F D E M W   | R1 = 0                   |
| R1 = 5         | in R1        | F D E M W   | R1 = 5                   |
| R2 = 10        | in R2        | F D E M W   | R2 = 10                  |
|                | NOP          | F D E M W   |                          |
|                | NOP          | F D E M W   |                          |
|                | NOT R2       | F D E M W   | R2 = FFFF FFEE           |
| R1 = 6         | Inc R1       | F D E M W   | R1 = 6                   |
|                | NOP          | F D E M W   |                          |
| R2 = FFFF FFEE | Dec R2       | F D E M W   | R2 = FFFF FFEE           |
|                | NOP          | F D E M W   |                          |
| R1 = 6         | OUT R1       | F F D E M W | R1 = 6                   |
| R2 = FFFF FFEE | OUT R2       | F F D E M W | R2 = FFFF FFEE           |
|                |              |             |                          |
|                |              |             | Data Inconsistency       |
|                |              |             | NOP & Data is consistent |

### 1. Inc R1:



### 2. NOT R2:



### 3. Dec R2:



### 4. Out R1:



### 5. Out R2:



- **V2 (With Forwarding):**

No Hazards appeared in the Test Case, and all outputs are right.

| Current values | Instructions |           | True Result    |
|----------------|--------------|-----------|----------------|
| R1 = FFFF FFFF | NOT R1       | F D E M W | R1 = FFFF FFFF |
|                | NOP          | F D E M W | --             |
| R1 = 0         | inc R1       | F D E M W | R1 = 0         |
| R1 = 5         | in R1        | F D E M W | R1 = 5         |
| R2 = 10        | in R2        | F D E M W | R2 = 10        |
| R2 = FFFF FFEE | NOT R2       | F D E M W | R2 = FFFF FFEE |
| R1 = 6         | inc R1       | F D E M W | R1 = 6         |
| R2 = FFFF FFEE | Dec R2       | F D E M W | R2 = FFFFFFFE  |
| R1 = 6         | out R1       | F D E M W | R1 = 6         |
| R2 = FFFF FFEE | out R2       | F D E M W | R2 = FFFFFFFE  |

### 1. Inc R1:



## 2. NOT R2:



### 3. DEC R2:



### 4. OUT R1:



### 5. OUT R2:



- V3 (With Forwarding/Hazard Detection):

Similar results as V2

- V4 (With Forwarding/Hazard Detection/Flushing):

Similar results as V2

## 2. Two Operands:

- V1 (No Forwarding/Hazard Detection/Flushing):

## Hazards:

1. As we can see in the screen shot we found that the instruction **SUB R5,R4,R6** , we need to get the right value of R4 and in order to do that **Add R1,R4,R4** must be in WB stage, so we read a wrong value
  2. Instruction **SHL R2,3**, data is read wrong as we can see in the screenshot, the decode comes earlier than WB
  3. The last 2 instructions (**SWAP R2,R5** and **ADD R5,R2,R2** ) we also read wrong values and both happen as decode comes earlier than WB.



## The Corresponding VHDL Sim:

- Executed Instructions: (in R1      ->      OR R2,R1,R1)
- Executed Instructions: (SHL R2,2    ->    end of instructions )





Solution:



## The Corresponding VHDL Sim:

- Executed Instructions: (in R1      ->      in R4)
- Executed Instructions: (ladd R3,R5,2 ->      SUB R5,R4,R6 )
- Executed Instructions: ( NOP      ->      SWAP R2,R5)
- Executed Instructions: ( NOP      -> end of instructions )







- **V2 (With Forwarding):**

In the given test case, there are no Load use cases, so we will never need to stall as our forwarding unit forward from both ALU and Memory.

So this test case will work without any problem.



## The Corresponding VHDL Sim:

- Executed Instructions: (in R1      ->      in R4)
- Executed Instructions: (IADD R3,R5,2      ->      SHL R2,2 )
- Executed Instructions: (SHR R2,3    ->    end of instructions )





- **V3 (With Forwarding/Hazard Detection):**

Similar results as V2

- **V4 (With Forwarding/Hazard Detection/Flushing):**

Similar results as V2

### 3. Memory:

- **V1 (No Forwarding/Hazard Detection/Flushing):**

As we can see in the screen shot we found that, **PUSH R1** instruction needs value of R1 to be F5 which is loaded in R1 from LDM R1,F5 . and this happens as decode comes earlier than WB.

Also **STD R2, 200** instruction needs to get value from POP R2.

| Current values | Instructions | Memory State        | Register State                               |
|----------------|--------------|---------------------|----------------------------------------------|
|                | In R2        | F D E M W           | R2 = 0CDAFE19                                |
|                | In R3        | F D E M W           | R3 = FFFF                                    |
|                | In R4        | F D E M W           | R4 = F320                                    |
| R1 = 0         | LDM R1,F5    | F D D E M W         | R1 = F5                                      |
| R2 = 0CDAFE19  | PUSH R1      | F F D E M W         | M[F9F] = 0                                   |
|                | PUSH R2      | F F D E M W         | M[F9F] = 0CDAFE19                            |
|                | POP R1       | F F D E M W         | R1 = M[F9F] = 0CDAFE19                       |
|                | POP R2       | F F D E M W         | R2 = M[FA0]=0                                |
| R2 = 0CDAFE19  | STD R2, 200  | F F D D E M W       | M[200] = R2 = 0CDAFE19                       |
| R1 = 0CDAFE19  | STD R1 , 202 | F F F D D E M W     | M[202] = R1 = 0CDAFE19                       |
|                | LDI R3 , 202 | F F F F D D E M W   | R3 = M[202] = 0CDAFE19                       |
|                | LDI R4 , 200 | F F F F F D D E M W | R4 = M[200] = 0CDAFE19                       |
|                |              |                     | SP = F9F<br>SP = F9E<br>SP = F9F<br>SP = FA0 |

Data Inconsistency  
 Data Dependency  
 Final value In WB (W)  
 Final value In MEM (M)

Memory View:

## Solution:

| Current values | Instructions | F D E M W           | R2 = 0CDAFE19     | R3 = FFFF              | R4 = F320      | R1 = F5 |                        |  |
|----------------|--------------|---------------------|-------------------|------------------------|----------------|---------|------------------------|--|
|                | In R2        | F D E M W           |                   |                        |                |         |                        |  |
|                | In R3        | F D E M W           |                   | R3 = FFFF              |                |         |                        |  |
|                | In R4        | F D E M W           |                   |                        | R4 = F320      |         |                        |  |
|                | LDM R1,F5    | F D D E M W         |                   |                        |                | R1 = F5 |                        |  |
|                | NOP          | F F D E M W         |                   |                        |                |         |                        |  |
|                | NOP          | F F D E M W         |                   |                        |                |         |                        |  |
| R1 = F5        | PUSH R1      | F F D E M W         | M[FA0] = F5       |                        |                |         |                        |  |
| R2 = 0CDAFE19  | PUSH R2      | F F D E M W         | M[F9F] = 0CDAFE19 |                        |                |         |                        |  |
|                | POP R1       | F F D E M W         |                   | R1 = M[F9F] = 0CDAFE19 |                |         |                        |  |
|                | POP R2       | F F D E M W         |                   |                        | R2 = M[FA0]=F5 |         |                        |  |
|                | NOP          | F F D E M W         |                   |                        |                |         |                        |  |
|                | NOP          | F F D E M W         |                   |                        |                |         |                        |  |
| R2 = F5        | STD R2 ,200  | F F D D E M W       |                   |                        |                |         | M[200] = R2 =F5        |  |
| R1 = 0CDAFE19  | STD R1 ,202  | F F F D D E M W     |                   |                        |                |         | M[202] = R1 = 0CDAFE19 |  |
|                | LDI R3 ,202  | F F F F D D E M W   |                   |                        |                |         | R3 = M[202] = 0CDAFE19 |  |
|                | LDI R4 ,200  | F F F F F D D E M W |                   |                        |                |         | R4 = M[200] = F5       |  |

## Memory View:

|          |          |
|----------|----------|
| 00000f89 | XXXXXXXX |
| 00000f8a | XXXXXXXX |
| 00000f8b | XXXXXXXX |
| 00000f8c | XXXXXXXX |
| 00000f8d | XXXXXXXX |
| 00000f8e | XXXXXXXX |
| 00000f8f | XXXXXXXX |
| 00000f90 | XXXXXXXX |
| 00000f91 | XXXXXXXX |
| 00000f92 | XXXXXXXX |
| 00000f93 | XXXXXXXX |
| 00000f94 | XXXXXXXX |
| 00000f95 | XXXXXXXX |
| 00000f96 | XXXXXXXX |
| 00000f97 | XXXXXXXX |
| 00000f98 | XXXXXXXX |
| 00000f99 | XXXXXXXX |
| 00000f9a | XXXXXXXX |
| 00000f9b | XXXXXXXX |
| 00000f9c | XXXXXXXX |
| 00000f9d | XXXXXXXX |
| 00000f9e | XXXXXXXX |
| 00000f9f | 0CDAFE19 |
| 00000fa0 | 000000F5 |
| 00000fa1 | XXXXXXXX |
| 00000fa2 | XXXXXXXX |
| 00000fa3 | XXXXXXXX |
| 00000fa4 | XXXXXXXX |
| 00000fa5 | XXXXXXXX |
| 00000fa6 | XXXXXXXX |
| 00000fa7 | XXXXXXXX |
| 00000fa8 | XXXXXXXX |
| 00000fa9 | vvvvvvvv |

| Memory Data - /processor/memory/DATA_MEMORY/ram - Default |          |
|-----------------------------------------------------------|----------|
| 000001f7                                                  | XXXXXXXX |
| 000001f8                                                  | XXXXXXXX |
| 000001f9                                                  | XXXXXXXX |
| 000001fa                                                  | XXXXXXXX |
| 000001fb                                                  | XXXXXXXX |
| 000001fc                                                  | XXXXXXXX |
| 000001fd                                                  | XXXXXXXX |
| 000001fe                                                  | XXXXXXXX |
| 000001ff                                                  | XXXXXXXX |
| 00000200                                                  | 000000F5 |
| 00000201                                                  | XXXXXXXX |
| 00000202                                                  | 0CDAFE19 |
| 00000203                                                  | XXXXXXXX |
| 00000204                                                  | XXXXXXXX |
| 00000205                                                  | XXXXXXXX |
| 00000206                                                  | XXXXXXXX |
| 00000207                                                  | XXXXXXXX |
| 00000208                                                  | XXXXXXXX |
| 00000209                                                  | XXXXXXXX |
| 0000020a                                                  | XXXXXXXX |
| 0000020b                                                  | XXXXXXXX |
| 0000020c                                                  | XXXXXXXX |
| 0000020d                                                  | XXXXXXXX |
| 0000020e                                                  | XXXXXXXX |
| 0000020f                                                  | XXXXXXXX |
| 00000210                                                  | XXXXXXXX |
| 00000211                                                  | XXXXXXXX |
| 00000212                                                  | XXXXXXXX |
| 00000213                                                  | XXXXXXXX |
| 00000214                                                  | XXXXXXXX |
| 00000215                                                  | XXXXXXXX |
| 00000216                                                  | XXXXXXXX |
| 00000217                                                  | XXXXXXXX |

- **V2 (With Forwarding):**

For this Test case No hazard occurs as for Pop R2 and STD R2 , 200 because we fetch instruction in 2 cycles so we stall for 1 cycle in decode, and hazard detection unit is not tested but if the test case was POP R2 then Inc R2, then hazard detection will work, you can strong test cases in test case folder.

So this test case will work without any problem.

| Current values | Instructions | F D E M W              | R2 = 0CDAFE19          |  |          |
|----------------|--------------|------------------------|------------------------|--|----------|
|                | In R2        | F D E M W              | R2 = 0CDAFE19          |  |          |
|                | In R3        | F D E M W              | R3 = FFFF              |  |          |
|                | In R4        | F D E M W              | R4 = F320              |  |          |
|                | LDM R1,F5    | F D D E M W            | R1 = F5                |  |          |
| R1 = 0         | PUSH R1      | F F D E M W            | M[F90] = F5            |  |          |
| R2 = 0CDAFE19  | PUSH R2      | F F D E M W            | M[F9F] = 0CDAFE19      |  | SP = F9F |
|                | POP R1       | F F D E M W            | R1 = M[F9F] = 0CDAFE19 |  | SP = F9E |
|                | POP R2       | F F D E M W            | R2 = M[FA0] = F5       |  | SP = F9F |
| R2 = F5        | STD R2, 200  | F F D D E M W          | M[200] = R2 = F5       |  | SP = FA0 |
| R1 = 0CDAFE19  | STD R1, 202  | F F F D D E M W        | M[202] = R1 = 0CDAFE19 |  |          |
|                | LDI R3 , 202 | F F F F D D E M W      | R3 = M[202] = 0CDAFE19 |  |          |
|                | LDI R4 , 200 | F F F F F D D E M W    | R4 = M[200] = F5       |  |          |
|                |              |                        |                        |  |          |
|                |              | Final value In WB (W)  |                        |  |          |
|                |              | Final value In MEM (M) |                        |  |          |

## Memory View:

|          |            |
|----------|------------|
| 00000f93 | XXXXXXXXXX |
| 00000f94 | XXXXXXXXXX |
| 00000f95 | XXXXXXXXXX |
| 00000f96 | XXXXXXXXXX |
| 00000f97 | XXXXXXXXXX |
| 00000f98 | XXXXXXXXXX |
| 00000f99 | XXXXXXXXXX |
| 00000f9a | XXXXXXXXXX |
| 00000f9b | XXXXXXXXXX |
| 00000f9c | XXXXXXXXXX |
| 00000f9d | XXXXXXXXXX |
| 00000f9e | XXXXXXXXXX |
| 00000f9f | 0CDAFE19   |
| 00000fa0 | 000000F5   |
| 00000fa1 | XXXXXXXXXX |
| 00000fa2 | XXXXXXXXXX |
| 00000fa3 | XXXXXXXXXX |
| 00000fa4 | XXXXXXXXXX |
| 00000fa5 | XXXXXXXXXX |
| 00000fa6 | XXXXXXXXXX |
| 00000fa7 | XXXXXXXXXX |
| 00000fa8 | XXXXXXXXXX |
| 00000fa9 | XXXXXXXXXX |
| 00000faa | XXXXXXXXXX |
| 00000fab | XXXXXXXXXX |
| 00000fac | XXXXXXXXXX |
| 00000fad | XXXXXXXXXX |
| 00000fae | XXXXXXXXXX |
| 00000faf | XXXXXXXXXX |
| 00000fb0 | XXXXXXXXXX |
| 00000fb1 | XXXXXXXXXX |
| 00000fb2 | XXXXXXXXXX |
| 00000fb3 | VVVVVVVV   |

|          |            |
|----------|------------|
| 000001f3 | XXXXXXXXXX |
| 000001f4 | XXXXXXXXXX |
| 000001f5 | XXXXXXXXXX |
| 000001f6 | XXXXXXXXXX |
| 000001f7 | XXXXXXXXXX |
| 000001f8 | XXXXXXXXXX |
| 000001f9 | XXXXXXXXXX |
| 000001fa | XXXXXXXXXX |
| 000001fb | XXXXXXXXXX |
| 000001fc | XXXXXXXXXX |
| 000001fd | XXXXXXXXXX |
| 000001fe | XXXXXXXXXX |
| 000001ff | XXXXXXXXXX |
| 00000200 | 000000F5   |
| 00000201 | XXXXXXXXXX |
| 00000202 | 0CDAFE19   |
| 00000203 | XXXXXXXXXX |
| 00000204 | XXXXXXXXXX |
| 00000205 | XXXXXXXXXX |
| 00000206 | XXXXXXXXXX |
| 00000207 | XXXXXXXXXX |
| 00000208 | XXXXXXXXXX |
| 00000209 | XXXXXXXXXX |
| 0000020a | XXXXXXXXXX |
| 0000020b | XXXXXXXXXX |
| 0000020c | XXXXXXXXXX |
| 0000020d | XXXXXXXXXX |
| 0000020e | XXXXXXXXXX |
| 0000020f | XXXXXXXXXX |
| 00000210 | XXXXXXXXXX |
| 00000211 | XXXXXXXXXX |
| 00000212 | XXXXXXXXXX |
| ~~~~~13  | ~~~~~13    |

### The Corresponding VHDL Sim:

- Executed Instructions: (in R2      ->      LDM R1,F5)
- Executed Instructions: (PUSH R1    ->    STD R1,202 )
- Executed Instructions: (LDD R3,202 ->    end of instructions )





- **V3 (With Forwarding/Hazard Detection):**

Similar results as V2

- **V4 (With Forwarding/Hazard Detection/Flushing):**

Similar results as V2

#### 4. Branching:

It has been tested on the final Project (Project-V4 Folder) and here is the Results:

I will make the test in several ways and will try to analyze the behavior as much as possible.

- Without trying any interrupt inside the code.



Here we can see inputs saved in the registers and the push instruction is executed also and the results will be given in the next image



## We start our stack from FA0

Then we will jump to R1 = 30



In this image we can see that pc\_out\_memory\_in (executed in fetch stage) = 30 right now which means jmp R1 executed then we will make add to set Zero flag =1. And pc will 100 as a wrong prediction then will be 101 but all these instructions are flushed and then will get the right one in the execution stage 51 to continue work



In the previous images we can see NOT R5 and INC R5 and then we received the value of R6 from in port then we used software solution with the data hazards detection unit because this unit in the decode we make all jz related work in fetch so willn't work but we implement an another in the fetch but we notice that it can't cover all cases so we add the software solution. Jump to 200 where we will pop R6 as we can see then call wil store pc in stack in FA0.

| Instance      | Value             |
|---------------|-------------------|
| /processor/fi | XXXXXXXXXX        |
| /processor/d  | XXXXXXXXXX        |
| /processor/d  | XXXXXXXXXX        |
| /processor/d  | XXXXXXXXXX        |
| /processor/lv | XXXXXXXXXX        |
| /processor/p  | XXXXXXXXXX        |
|               | 00000E90          |
|               | 00000E91          |
|               | 00000E92          |
|               | 00000E93          |
|               | 00000E94          |
|               | 00000E95          |
|               | 00000E96          |
|               | 00000E97          |
|               | 00000E98          |
|               | 00000E99          |
|               | 00000E9a          |
|               | 00000E9b          |
|               | 00000E9c          |
|               | 00000E9d          |
|               | 00000E9e          |
|               | 00000E9f          |
|               | 00000fa0 00000202 |
|               | 00000fa1          |
|               | 00000fa2          |
|               | 00000fa3          |
|               | XXXXXXXXXX        |



We can see that when we RET from .ORG 300 we make a pop and all the operations in the return is down and R6 increment.

- This time we will try the 1\_st INT only.



We can see that we INT is raised we start from next cycle to fetch him (1000 -> 1100) and then we make the value M[3,2] enter the pipe as PC to got executed then we make stall one cycle to be able to push pc and flags in the stack

| Instance     |                   |
|--------------|-------------------|
| /processor/f | 00000f95 XXXXXXXX |
| /processor/d | 00000f96 XXXXXXXX |
| /processor/d | 00000f97 XXXXXXXX |
| /processor/d | 00000f98 XXXXXXXX |
| /processor/h | 00000f99 XXXXXXXX |
| /processor/p | 00000f9a XXXXXXXX |
|              | 00000f9b XXXXXXXX |
|              | 00000f9c XXXXXXXX |
|              | 00000f9d XXXXXXXX |
|              | 00000f9e 00000031 |
|              | 00000f9f 00000002 |
|              | 00000fa0 00000300 |
|              | 00000fa1 XXXXXXXX |
|              | 00000fa2 XXXXXXXX |
|              | 00000fa3 XXXXXXXX |
|              | 00000fa4 XXXXXXXX |
|              | 00000fa5 XXXXXXXX |
|              | 00000fa6 XXXXXXXX |

Note bit 1 = zero flag.



ADD, OUT, RTI are done and then we return back JZ R2 and will jump directly to 50 then JZ R3 will jump as taken but we will know in the execution stage that it's not and correct the prediction. **Note we start with weak taken**



| Memory Data - /processor/memory/DATA_Mem |                   |
|------------------------------------------|-------------------|
| Instance                                 | XXXXXXXX          |
| /processor/f                             | 00000f95 XXXXXXXX |
| /processor/d                             | 00000f96 XXXXXXXX |
| /processor/d                             | 00000f97 XXXXXXXX |
| /processor/N                             | 00000f98 XXXXXXXX |
| /processor/p                             | 00000f99 XXXXXXXX |
|                                          | 00000f9a XXXXXXXX |
|                                          | 00000f9b XXXXXXXX |
|                                          | 00000f9c XXXXXXXX |
|                                          | 00000f9d XXXXXXXX |
|                                          | 00000f9e 00000031 |
|                                          | 00000f9f 00000002 |
|                                          | 00000fa0 00000202 |
|                                          | 00000fa1 XXXXXXXX |
|                                          | 00000fa2 XXXXXXXX |
|                                          | 00000fa3 XXXXXXXX |
|                                          | 00000fa4 XXXXXXXX |
|                                          | 00000fa5 XXXXXXXX |
|                                          | 00000fa6 XXXXXXXX |
|                                          | 00000fa7 XXXXXXXX |

We leave old values as the sp pointer will make over write on them.



- With 2 INT in the same time.



| Instance      |            |
|---------------|------------|
| /processor/fi | 00000f97   |
| /processor/d  | XXXXXXXXXX |
| /processor/d  | 00000f99   |
| /processor/n  | XXXXXXXXXX |
| /processor/p  | 00000f9b   |
|               | XXXXXXXXXX |
|               | 00000f9c   |
|               | XXXXXXXXXX |
|               | 00000f9d   |
|               | XXXXXXXXXX |
|               | 00000f9e   |
|               | 00000031   |
|               | 00000f9f   |
|               | 00000002   |
|               | 00000fa0   |
|               | 00000300   |
|               | XXXXXXXXXX |



Here second INT start right after call so we ignore call and can't keep it because we keep the instruction called with the INT so CALL willn't be executed here and we will do the int and then rti and then inc R6 so it will be 301 instead of 401 and we can see that we didn't store any thing for call in stack to there no unneeded values here.



## 5. Branch Prediction:

We took the incremental way to build our pipe so we make the main pipe as a first step then add forwarding then data hazards detection units then we add the modules implemented to the Branch and Prediction at the same time so we add the branch on the pipe that contains the forwarding and data hazard detection unit so it's hard to merge these parts again with the main pipe and the pipe with forwarding and the pipe with data hazard detection unit as it will consume us a lot of time. So in this part, I'm gonna write a detailed analysis of the cases when we will need a forwarding and when we will need a data hazard detection unit as I can, and then I will show you the final solution with the prediction units integrated with the pipe.

- V1 (No Forwarding/Hazard Detection/Flushing):

In this part =>

```
.ORG 10
LDM R2,0A #R2=0A
LDM R0,0 #R0=0
LDM R1,50 #R1=50
LDM R3,20 #R3=20
LDM R4,2 #R4=2
JMP R3 #Jump to 20
```

Every LDM will be separated into two instructions one for the instruction itself and the other one will be for the data and we execute this in code as we make the instruction enter to the decode stage and fetch the value then hold the pipe and make the ID/EX freeze some of its registers and let the others writable to update then with the write IMM/EA. Then we make the control unit out the proposed signal for LDM and continue in the pipe

|   |   |   |   |   |    |
|---|---|---|---|---|----|
| F | D | D | E | M | WB |
|---|---|---|---|---|----|

So we can guarantee that we have the write data when we are entering the execution. And here we don't have any hazards here. So it will work properly.

```
.ORG 20
SUB R0,R2,R5 #check if R0 = R2
JZ R1 #jump if R0=R2 to 50
ADD R4,R4,R4 #R4 = R4*2
OUT R4
INC R0
JMP R3 #jump to 20
```

In this part of the code, we will need a forwarding unit for OUT R4 because it is out DST of the previous instruction. So it will work properly by the forwarding unit or we can use NOP.

The view in case of using NOP

|   |   |   |   |    |                         |
|---|---|---|---|----|-------------------------|
| F | D | E | M | WB | ADD R4,R4,R4 #R4 = R4*2 |
|   |   |   |   |    | NOP                     |
|   |   |   |   |    | NOP                     |
|   |   |   |   |    | OUT R4                  |

But in case of forwarding unit and this is the case that we will run on it the test file will be like



But before we go to the next part from the code I want to explain how JZ is working in our design to make the results more obvious.

As we can see in the following Figure after we fetch the JZ it will enter the decode stage only if it's SRC is available because of the hazard detection unit and we also implement another forwarding unit for the branch to get the data as fast as possible. When JZ enters the decode we will assume that this JZ weak taken so will fetch the instruction that is pointed to by the SRC and it will be PC = 50 and then this PC will enter decode and the PC = 51 will be fetched and now the JZ in executing and determine the PC will change to PC = 22 instruction after JZ or will continue to 52 if we will not continue we will flush the pipe.

In the next cycle (decode stage) the prediction will know that it will not be taken ,and start to move it from weak taken to weak not taken. After that loop will continue and make it move to Strong not taken then remain until loop end and prediction unit finds out that this will be taken in this time so will move it from strong not taken to weak not taken.

```
.ORG 50
LDM R0,0 #R0=0
```



LDM R2,8 #R2=8

LDM R3,60 #R3=60

LDM R4,3 #R4=3

JMP R3 #jump to 60

In this part of code, We have no hazards and the JMP will be executed properly without any problem.

.ORG 60

ADD R4,R4,R4 #R4 = R4\*2

OUT R4

INC R0

AND R0,R2,R5 #when R0 < R2(8) answer will be zero

JZ R3 #jump if R0 < R2 to 60

INC R4

OUT R4

In this part, we will have 2 hazards

1- ADD R4,R4,R4 #R4 = R4\*2

OUT R4

2- INC R4

OUT R4

In both, we will handle them as we discussed earlier, And JZ also will work as shown previously.

- **V2 (With Forwarding):**

Already we illustrate this case in version 1.

- **V3 (With Forwarding/Hazard Detection):**

No need for a data hazard detection unit here as there is no load use case in this test case.

- **V4 (With Forwarding/Hazard Detection/Flushing):**

In this case, we will be like V1 but we should illustrate that in v1 we still don't have a jump module but here is the current version that contains jump and prediction.

In the following part, we will show values in the register file and the waveform.

In the following figure, we can see values in registers from LDM instructions and JMP R3 was done and we got the wrong decision to take it as taken JZ R1 but in the next figure you will see that PC will be corrected.



|          |   |
|----------|---|
| 00000015 | 2 |
| 00000016 | 2 |
| 00000017 | 2 |
| 00000018 | 2 |
| 00000019 | 2 |
| 0000001a | 2 |
| 0000001b | 2 |
| 0000001c | 2 |
| 0000001d | 2 |
| 0000001e | 2 |
| 0000001f | 2 |
| 00000020 | 2 |
| 00000021 | 1 |
| 00000022 | 2 |
| 00000023 | 2 |
| 00000024 | 2 |
| 00000025 | 2 |
| 00000026 | 2 |
| 00000027 | 2 |

We initially make all instructions have a weak taken value. Here prediction detected that it will not taken so move it from 2 (weak taken) to 1 (weak not taken).



Here prediction predicts that it will not be taken so we go to 22 instead of going to 50 then 51 then return back to 22 and will move it from 1 (weak not taken) to 0 (strong not taken). And in the previous figure also SUB R0,R2,R5 is done and R4 also updated and out value in The Interval from 1900ps to 2000ps and R0 incremented and the loop will continue till R2 == R0 and then will jump to 50.

|          | Msgs |
|----------|------|
| 00000015 | 2    |
| 00000016 | 2    |
| 00000017 | 2    |
| 00000018 | 2    |
| 00000019 | 2    |
| 0000001a | 2    |
| 0000001b | 2    |
| 0000001c | 2    |
| 0000001d | 2    |
| 0000001e | 2    |
| 0000001f | 2    |
| 00000020 | 2    |
| 00000021 | 0    |
| 00000022 | 3    |









In the previous figure we can notice that we predict the JZ as not taken and then correct our prediction to go to 50 and move this jump from 0 (Strong not taken) to 1 (weak not taken)

|          | 2 |
|----------|---|
| 00000017 | 2 |
| 00000018 | 2 |
| 00000019 | 2 |
| 0000001a | 2 |
| 0000001b | 2 |
| 0000001c | 2 |
| 0000001d | 2 |
| 0000001e | 2 |
| 0000001f | 2 |
| 00000020 | 2 |
| 00000021 | 1 |
| 00000022 | 2 |
| 00000023 | 2 |
| 00000024 | 2 |
| 00000025 | 2 |



Here we predicted this jump as weak taken and our prediction is true so we update its state from 2 (weak taken) to 3 (strong taken).

|               |          |   |
|---------------|----------|---|
| /processor/de | 0000005b | 2 |
| /processor/de | 0000005c | 2 |
| /processor/Mr | 0000005d | 2 |
| /processor/pr | 0000005e | 2 |
|               | 0000005f | 2 |
|               | 00000060 | 3 |
|               | 00000061 | 2 |
|               | 00000062 | 2 |
|               | 00000063 | 2 |
|               | 00000064 | 3 |
|               | 00000065 | 2 |
|               | 00000066 | 2 |
|               | 00000067 | 2 |
|               | 00000068 | 2 |
|               | 00000069 | 2 |
|               | 0000006a | 2 |
|               | 0000006b | 2 |
|               | 0000006c | 2 |
|               | 0000006d | 2 |
|               | 0000006e | 2 |
|               | 0000006f | 2 |





|                 |          |   |
|-----------------|----------|---|
|                 | 0000000b | 2 |
| ◆ /processor/de | 0000005c | 2 |
| ◆ /processor/de | 0000005d | 2 |
| ◆ /processor/M  | 0000005e | 2 |
| ◆ /processor/pf | 0000005f | 2 |
|                 | 00000060 | 3 |
|                 | 00000061 | 2 |
|                 | 00000062 | 2 |
|                 | 00000063 | 2 |
|                 | 00000064 | 3 |
|                 | 00000065 | 2 |
|                 | 00000066 | 2 |
|                 | 00000067 | 2 |
|                 | 00000068 | 2 |
|                 | 00000069 | 2 |
|                 | 0000006a | 2 |
|                 | 0000006b | 2 |
|                 | 0000006c | 2 |
|                 | 0000006d | 2 |
|                 | 0000006e | 2 |
|                 | 000000ff | 2 |



Here we can see out value before it be written in the register file which mean forwarding unit works properly and the loop will continue as we will see in the figures until  $R0 < R2(8)$



Here we end the second loop and correct the last prediction from strong taken to weak taken



Here we are finishing our program as the data inside R4 incremented by 1 and appears in the out and data saved in the register file.