

# Assignment 1

EE599 – Accelerated Computing Using FPGA

Spring 2020

GitHub Repo Link: [https://github.com/Aditya-  
Pharande/EE599\\_Phารันด์\\_2739814954](https://github.com/Aditya-Pharande/EE599_Phารันด์_2739814954)

By

Aditya Rajendra Pharande

2739814954

[pharande@usc.edu](mailto:pharande@usc.edu)

## Task 1: Odd-Even Transposition Sort

Odd-even transposition sort algorithm is a parallel sorting algorithm. It sorts n elements in n clocks (n is even), each of which requires n/2 compare-exchange operations. This algorithm alternates between two phases, called the odd and even phases.

Let  $\langle a_1, a_2, \dots, a_n \rangle$  be the sequence to be sorted. During the odd phase, elements with odd indices are compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs  $(a_1, a_2), (a_3, a_4), \dots, (a_{n-1}, a_n)$  are compare-exchanged (assuming n is even). Similarly, during the even phase, elements with even indices are compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs  $(a_2, a_3), (a_4, a_5), \dots, (a_{n-2}, a_{n-1})$  are compare-exchanged. After n phases of odd-even exchanges, the sequence is sorted.

Example:



Figure 1: Example of odd-even transposition sort

Question & Example taken from homework description.

## I. 16 Elements Odd Even Sort Unit:

### Simulation Result:

To check worst case , descending element array is given, whereas the Odd-Even Sort designed in this project aligns elements in ascending order:

Given Input = [ 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1]

Expected Output = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] after 16 clocks cycles.

Done signal is set once sorting finished. The designed uses 8 compare units, are recirculate data each clock to reuse the hardware in next clock.



## Elaborated Schematics:



Figure 1: Basic Compare Swap Unit Schematics



Figure 2: Schematics of 8 Compare Swap units to sort 16 elements. →



Figure 1 Schematics of 16 element odd even sort unit

## Synthesized Schematics:



Figure 2 Compare Sort Unit  
Synthesized Schematics



Figure 3: Synthesized Schematics of Compare Sort Array to sort 16 elements



Figure 4: Synthesized Schematics of Odd Even Sort Module for 16 elements

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 308  | 0     | 63400     | 0.49  |
| LUT as Logic          | 308  | 0     | 63400     | 0.49  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 263  | 0     | 126800    | 0.21  |
| Register as Flip Flop | 263  | 0     | 126800    | 0.21  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 258  | 0     | 210       | 122.86 |
| Bonded IPADs                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

### 7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 263  | Flop & Latch        |
| LUT5     | 182  | LUT                 |
| LUT3     | 137  | LUT                 |
| OBUF     | 129  | IO                  |
| IBUF     | 129  | IO                  |
| LUT4     | 65   | LUT                 |
| LUT6     | 56   | LUT                 |
| CARRY4   | 8    | CarryLogic          |
| LUT1     | 3    | LUT                 |
| BUFG     | 1    | Clock               |

## Timing Estimation:

### Design Timing Summary

| Setup                                | Hold                             | Pulse Width                                       |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Worst Negative Slack (WNS): 6.162 ns | Worst Hold Slack (WHS): 0.176 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 519       | Total Number of Endpoints: 519   | Total Number of Endpoints: 264                    |

All user specified timing constraints are met.

## Power Estimation:

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

**Total On-Chip Power:** 0.225 W  
**Design Power Budget:** Not Specified  
**Power Budget Margin:** N/A  
**Junction Temperature:** 26.0°C  
 Thermal Margin: 59.0°C (12.8 W)  
 Effective θJA: 4.6°C/W  
 Power supplied to off-chip devices: 0 W  
 Confidence level: Low  
[Launch Power Constraint Advisor](#) to find and fix invalid switching activity



## II. 32 Elements Odd Even Sort Unit: Elaborated Schematics:



Figure 1: Basic Compare Swap Unit Schematics



Figure 2: Schematics of 16 Compare Swap units to sort 32 elements. -→



Figure 5 Schematics of 32 element odd even sort unit

## Synthesized Schematics:



Figure 6 Compare Sort Unit  
Synthesized Schematics



Figure 7: Synthesized Schematics of Compare Sort Array to sort 32 elements



*Figure 8: Synthesized Schematics of Odd Even Sort Module for 32 elements*

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 633  | 0     | 63400     | 1.00  |
| LUT as Logic          | 633  | 0     | 63400     | 1.00  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 519  | 0     | 126800    | 0.41  |
| Register as Flip Flop | 519  | 0     | 126800    | 0.41  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 514  | 0     | 210       | 244.76 |
| Bonded IPADS                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

### 7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 519  | Flop & Latch        |
| LUT5     | 369  | LUT                 |
| OBUF     | 257  | IO                  |
| LUT3     | 257  | LUT                 |
| IBUF     | 257  | IO                  |
| LUT4     | 145  | LUT                 |
| LUT6     | 126  | LUT                 |
| CARRY4   | 16   | CarryLogic          |
| LUT2     | 1    | LUT                 |
| LUT1     | 1    | LUT                 |
| BUFG     | 1    | Clock               |

## Timing Estimation:

| Design Timing Summary                |                                  |                                          |          |
|--------------------------------------|----------------------------------|------------------------------------------|----------|
| Setup                                | Hold                             | Pulse Width                              |          |
| Worst Negative Slack (WNS): 6.160 ns | Worst Hold Slack (WHS): 0.146 ns | Worst Pulse Width Slack (WPWS):          | 4.500 ns |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): | 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints:             | 0        |
| Total Number of Endpoints: 1036      | Total Number of Endpoints: 1036  | Total Number of Endpoints:               | 520      |

All user specified timing constraints are met.

## Power Estimation:

### Summary

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

|                                     |                      |
|-------------------------------------|----------------------|
| <b>Total On-Chip Power:</b>         | <b>0.366 W</b>       |
| <b>Design Power Budget:</b>         | <b>Not Specified</b> |
| <b>Power Budget Margin:</b>         | <b>N/A</b>           |
| <b>Junction Temperature:</b>        | <b>26.7°C</b>        |
| Thermal Margin:                     | 58.3°C (12.7 W)      |
| Effective θJA:                      | 4.6°C/W              |
| Power supplied to off-chip devices: | 0 W                  |
| Confidence level:                   | Low                  |



### III. 64 Elements Odd Even Sort Unit:

#### Elaborated Schematics:



Figure 1: Basic Compare Swap Unit to sort 64 elements. - →  
Figure 2: Schematics of 32 Compare Swap units



Figure 9 Schematics of 64 element odd even sort unit



## Synthesized Schematics:



*Figure 10 Compare Sort Unit  
Synthesized Schematics*



*Figure 11: Synthesized Schematics of Compare Sort Array to sort 64 elements*



*Figure 12: Part of Synthesized Schematics of Odd Even Sort Module for 64 elements*

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 1270 | 0     | 63400     | 2.00  |
| LUT as Logic          | 1270 | 0     | 63400     | 2.00  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 1040 | 0     | 126800    | 0.82  |
| Register as Flip Flop | 1040 | 0     | 126800    | 0.82  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 1026 | 0     | 210       | 488.57 |
| Bonded IPADS                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

### 7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 1040 | Flop & Latch        |
| LUT5     | 756  | LUT                 |
| LUT3     | 525  | LUT                 |
| OBUF     | 513  | IO                  |
| IBUF     | 513  | IO                  |
| LUT4     | 257  | LUT                 |
| LUT6     | 250  | LUT                 |
| CARRY4   | 32   | CarryLogic          |
| LUT1     | 1    | LUT                 |
| BUFG     | 1    | Clock               |

## Timing Estimation:

### Design Timing Summary

| Setup                                | Hold                             | Pulse Width                                       |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Worst Negative Slack (WNS): 6.162 ns | Worst Hold Slack (WHS): 0.142 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 2064      | Total Number of Endpoints: 2064  | Total Number of Endpoints: 1041                   |

All user specified timing constraints are met.

## Power Estimation:

### Summary

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

|                                     |                      |
|-------------------------------------|----------------------|
| <b>Total On-Chip Power:</b>         | <b>0.657 W</b>       |
| <b>Design Power Budget:</b>         | <b>Not Specified</b> |
| <b>Power Budget Margin:</b>         | <b>N/A</b>           |
| <b>Junction Temperature:</b>        | <b>28.0°C</b>        |
| Thermal Margin:                     | 57.0°C (12.4 W)      |
| Effective θJA:                      | 4.6°C/W              |
| Power supplied to off-chip devices: | 0 W                  |
| Confidence level:                   | Low                  |

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity



## IV. 128 Elements Odd Even Sort Unit: Elaborated Schematics:



*Figure 1: Basic Compare Swap Unit Schematics*

*Figure 2: Schematics of 64 Compare Swap units to sort 128 elements. - →*

*Figure 13 Schematics of 128 element odd even sort unit*



## Synthesized Schematics:



*Figure 14 Compare Sort Unit  
Synthesized Schematics*



*Figure 15: Synthesized Schematics of Compare Sort Array to sort 64 elements*



*Figure 16: Parts of Synthesized Schematics of Odd Even Sort Module for 64 elements*

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 2574 | 0     | 63400     | 4.06  |
| LUT as Logic          | 2574 | 0     | 63400     | 4.06  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 2074 | 0     | 126800    | 1.64  |
| Register as Flip Flop | 2074 | 0     | 126800    | 1.64  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 2050 | 0     | 210       | 976.19 |
| Bonded IPADS                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

### 7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 2074 | Flop & Latch        |
| LUT5     | 1530 | LUT                 |
| LUT3     | 1046 | LUT                 |
| OBUF     | 1025 | IO                  |
| IBUF     | 1025 | IO                  |
| LUT4     | 522  | LUT                 |
| LUT6     | 507  | LUT                 |
| CARRY4   | 64   | CarryLogic          |
| LUT1     | 1    | LUT                 |
| BUFG     | 1    | Clock               |

## Timing Estimation:

### Design Timing Summary

| Setup                                | Hold                             | Pulse Width                                       |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Worst Negative Slack (WNS): 6.002 ns | Worst Hold Slack (WHS): 0.144 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 4122      | Total Number of Endpoints: 4122  | Total Number of Endpoints: 2075                   |

All user specified timing constraints are met.

## Power Estimation:

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

|                                     |                      |
|-------------------------------------|----------------------|
| <b>Total On-Chip Power:</b>         | <b>1.226 W</b>       |
| <b>Design Power Budget:</b>         | <b>Not Specified</b> |
| <b>Power Budget Margin:</b>         | <b>N/A</b>           |
| <b>Junction Temperature:</b>        | <b>30.6°C</b>        |
| Thermal Margin:                     | 54.4°C (11.8 W)      |
| Effective θJA:                      | 4.6°C/W              |
| Power supplied to off-chip devices: | 0 W                  |
| Confidence level:                   | Low                  |

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity



## Task 2: Dense Matrix-Matrix Multiplication

### 2.1 Scalable Multiply and adder tree

Consider two matrices A and B, each having the size of  $n \times n$  where  $n = 2^r$ .

Figure below shows an example design of Multiply and Adder Tree. The adder tree consists of a Multiplication Step following Adder Steps. Given the size of matrices is  $n \times n$ , there are  $n$  multipliers in the first stage. Assume that matrix A saved in row order, and matrix B saved in column order in the memory. In the beginning, the first row of A and the first column of B loaded and multiplied together. Then in each Adder Step, partial sums are added together until it produces the final result corresponding to an element in the output matrix. Adder steps consist of 2 element adders as shown in Figure 2. Notice that Multiply and adder tree is a pipeline process. Notice that in each step, after corresponding rows and columns of A and B going through the pipe, it produces one element of the output matrix

Example:



Figure 3: Example multiply and adder tree design

Question & Example taken from homework description.

## I. 4\*4 Matrix Multiplication

### Simulation Result:

To check, two 4\*4 matrix consisting of randomize inputs were generated in TB. Done signal is set once module started producing result. The Matrices generated are as follows and result is given.

| Matrix 1 |     |    |    | Matrix 2 |     |     |     | Output Matrix |       |       |       |
|----------|-----|----|----|----------|-----|-----|-----|---------------|-------|-------|-------|
| 103      | 42  | 31 | 97 | 98       | 123 | 55  | 16  | 15126         | 21078 | 19901 | 17223 |
| 30       | 100 | 49 | 30 | 35       | 86  | 107 | 54  | 11207         | 16296 | 16033 | 11375 |
| 47       | 12  | 5  | 90 | 93       | 64  | 17  | 35  | 6121          | 9743  | 12504 | 12915 |
| 110      | 21  | 14 | 63 | 7        | 29  | 95  | 126 | 13258         | 18059 | 14520 | 11322 |



Done signal high  
When Output starts coming

## Elaborated Schematics:



Figure 1: Basic MAC Unit



Figure 2: Stage 1 Array of 4 multiplier units. →



Figure 17: 2 level Adder Tree

## Synthesized Schematics:



**Figure 19 Adder Unit  
Synthesized Schematics**



**Figure 18: Synthesized Schematics multiplier unit**

**Figure 20: 2 level Adder Tree Synthesized Schematics**





**Figure 21: First Stage Multiplier Array Synthesized Schematics**



**Figure 22: Synthesized Schematics of 4\*4 MAC unit**

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 308  | 0     | 63400     | 0.49  |
| LUT as Logic          | 308  | 0     | 63400     | 0.49  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 263  | 0     | 126800    | 0.21  |
| Register as Flip Flop | 263  | 0     | 126800    | 0.21  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util% |
|-----------------------------|------|-------|-----------|-------|
| Bonded IOB                  | 81   | 0     | 210       | 38.57 |
| Bonded IPADs                | 0    | 0     | 2         | 0.00  |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00  |
| PHASER_REF                  | 0    | 0     | 6         | 0.00  |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00  |
| IN_FIFO                     | 0    | 0     | 24        | 0.00  |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00  |
| IBUFDS                      | 0    | 0     | 202       | 0.00  |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00  |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00  |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00  |
| ILOGIC                      | 0    | 0     | 210       | 0.00  |
| OLOGIC                      | 0    | 0     | 210       | 0.00  |

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 192  | Flop & Latch        |
| LUT2     | 152  | LUT                 |
| LUT6     | 148  | LUT                 |
| LUT4     | 84   | LUT                 |
| IBUF     | 65   | IO                  |
| CARRY4   | 52   | CarryLogic          |
| OBUF     | 16   | IO                  |
| LUT5     | 12   | LUT                 |
| LUT3     | 12   | LUT                 |
| BUFG     | 1    | Clock               |

## Timing Estimation:

### Design Timing Summary

| Setup                                | Hold                             | Pulse Width                                       |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Worst Negative Slack (WNS): 7.538 ns | Worst Hold Slack (WHS): 0.152 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 128       | Total Number of Endpoints: 128   | Total Number of Endpoints: 193                    |

All user specified timing constraints are met.

## Power Estimation:

### Summary

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

|                                     |                      |
|-------------------------------------|----------------------|
| <b>Total On-Chip Power:</b>         | <b>0.114 W</b>       |
| <b>Design Power Budget:</b>         | <b>Not Specified</b> |
| <b>Power Budget Margin:</b>         | <b>N/A</b>           |
| <b>Junction Temperature:</b>        | <b>25.5°C</b>        |
| Thermal Margin:                     | 59.5°C (12.9 W)      |
| Effective θJA:                      | 4.6°C/W              |
| Power supplied to off-chip devices: | 0 W                  |
| Confidence level:                   | Low                  |

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity



## II. 8\*8 Matrix Multiplication

### Elaborated Schematics:



Figure1: MAC Unit with 8 multiplier units and Three layers of adder tree

## Synthesized Schematics:



**Figure 24** Adder Unit  
Synthesized Schematics



**Figure 23:** Synthesized Schematics multiplier unit

**Figure 25:** Three level Adder Tree  
Synthesized Schematics





**Figure 26: First Stage Multiplier Array Synthesized Schematics**



**Figure 27: Synthesized Schematics of 8x8 MAC unit**

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 680  | 0     | 63400     | 1.07  |
| LUT as Logic          | 680  | 0     | 63400     | 1.07  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 384  | 0     | 126800    | 0.30  |
| Register as Flip Flop | 384  | 0     | 126800    | 0.30  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util% |
|-----------------------------|------|-------|-----------|-------|
| Bonded IOB                  | 145  | 0     | 210       | 69.05 |
| Bonded IPADS                | 0    | 0     | 2         | 0.00  |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00  |
| PHASER_REF                  | 0    | 0     | 6         | 0.00  |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00  |
| IN_FIFO                     | 0    | 0     | 24        | 0.00  |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00  |
| IBUFDS                      | 0    | 0     | 202       | 0.00  |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00  |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00  |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00  |
| ILOGIC                      | 0    | 0     | 210       | 0.00  |
| OLOGIC                      | 0    | 0     | 210       | 0.00  |

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 384  | Flop & Latch        |
| LUT2     | 320  | LUT                 |
| LUT6     | 296  | LUT                 |
| LUT4     | 168  | LUT                 |
| IBUF     | 129  | IO                  |
| CARRY4   | 108  | CarryLogic          |
| LUT5     | 24   | LUT                 |
| LUT3     | 24   | LUT                 |
| OBUF     | 16   | IO                  |
| BUFG     | 1    | Clock               |

## Timing Estimation:

### Design Timing Summary

| Setup                                | Hold                             | Pulse Width                                       |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Worst Negative Slack (WNS): 7.538 ns | Worst Hold Slack (WHS): 0.152 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 256       | Total Number of Endpoints: 256   | Total Number of Endpoints: 385                    |

All user specified timing constraints are met.

## Power Estimation:

### Summary

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

**Total On-Chip Power:** 0.121 W  
**Design Power Budget:** Not Specified  
**Power Budget Margin:** N/A  
**Junction Temperature:** 25.6°C  
Thermal Margin: 59.4°C (12.9 W)  
Effective θJA: 4.6°C/W  
Power supplied to off-chip devices: 0 W  
Confidence level: Low

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity

### On-Chip Power



### **III. 16\*16 Matrix Multiplication**

### **Elaborated Schematics:**



**Figure 1: MAC Unit with 16 multiplier units and Four layers of adder tree**

## Synthesized Schematics:



**Figure 29 Adder Unit  
Synthesized Schematics**

**Figure 30: Four level Adder Tree  
Synthesized Schematics**



**Figure 28: Synthesized Schematics multiplier unit**



**Figure 31: First Stage Multiplier Array Synthesized Schematics**

**Figure 32: Synthesized Schematics of 16\*16 MAC unit**

## Used Resource Estimation:

. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 1376 | 0     | 63400     | 2.17  |
| LUT as Logic          | 1376 | 0     | 63400     | 2.17  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 768  | 0     | 126800    | 0.61  |
| Register as Flip Flop | 768  | 0     | 126800    | 0.61  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 273  | 0     | 210       | 130.00 |
| Bonded IPADS                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 768  | Flop & Latch        |
| LUT2     | 656  | LUT                 |
| LUT6     | 592  | LUT                 |
| LUT4     | 336  | LUT                 |
| IBUF     | 257  | IO                  |
| CARRY4   | 220  | CarryLogic          |
| LUT5     | 48   | LUT                 |
| LUT3     | 48   | LUT                 |
| OBUF     | 16   | IO                  |
| BUFG     | 1    | Clock               |

## Timing Estimation:

| Design Timing Summary                |                                  |                                                   |
|--------------------------------------|----------------------------------|---------------------------------------------------|
| Setup                                | Hold                             | Pulse Width                                       |
| Worst Negative Slack (WNS): 7.538 ns | Worst Hold Slack (WHS): 0.152 ns | Worst Pulse Width Slack (WPWS): 4.500 ns          |
| Total Negative Slack (TNS): 0.000 ns | Total Hold Slack (THS): 0.000 ns | Total Pulse Width Negative Slack (TPWS): 0.000 ns |
| Number of Failing Endpoints: 0       | Number of Failing Endpoints: 0   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 512       | Total Number of Endpoints: 512   | Total Number of Endpoints: 769                    |

All user specified timing constraints are met.

## Power Estimation:



## IV. 32\*32 Matrix Multiplication

### Elaborated Schematic



Figure1: MAC Unit with 32 multiplier units and Five layers of adder tree

## Synthesized Schematics:



Figure 34 Adder Unit  
Synthesized Schematics



Figure 33: Synthesized Schematics multiplier unit

Figure 35: Five level Adder Tree Synthesized Schematics





**Figure 36: First Stage Multiplier Array Synthesized Schematics**

**Figure 37: Synthesized Schematics of 32\*32 MAC unit**

## Used Resource Estimation:

### 1. Slice Logic

| Site Type             | Used | Fixed | Available | Util% |
|-----------------------|------|-------|-----------|-------|
| Slice LUTs*           | 2768 | 0     | 63400     | 4.37  |
| LUT as Logic          | 2768 | 0     | 63400     | 4.37  |
| LUT as Memory         | 0    | 0     | 19000     | 0.00  |
| Slice Registers       | 1536 | 0     | 126800    | 1.21  |
| Register as Flip Flop | 1536 | 0     | 126800    | 1.21  |
| Register as Latch     | 0    | 0     | 126800    | 0.00  |
| F7 Muxes              | 0    | 0     | 31700     | 0.00  |
| F8 Muxes              | 0    | 0     | 15850     | 0.00  |

### 4. IO and GT Specific

| Site Type                   | Used | Fixed | Available | Util%  |
|-----------------------------|------|-------|-----------|--------|
| Bonded IOB                  | 529  | 0     | 210       | 251.90 |
| Bonded IPADs                | 0    | 0     | 2         | 0.00   |
| PHY_CONTROL                 | 0    | 0     | 6         | 0.00   |
| PHASER_REF                  | 0    | 0     | 6         | 0.00   |
| OUT_FIFO                    | 0    | 0     | 24        | 0.00   |
| IN_FIFO                     | 0    | 0     | 24        | 0.00   |
| IDELAYCTRL                  | 0    | 0     | 6         | 0.00   |
| IBUFDS                      | 0    | 0     | 202       | 0.00   |
| PHASER_OUT/PHASER_OUT_PHY   | 0    | 0     | 24        | 0.00   |
| PHASER_IN/PHASER_IN_PHY     | 0    | 0     | 24        | 0.00   |
| IDELAYE2/IDELAYE2_FINEDELAY | 0    | 0     | 300       | 0.00   |
| ILOGIC                      | 0    | 0     | 210       | 0.00   |
| OLOGIC                      | 0    | 0     | 210       | 0.00   |

### 7. Primitives

| Ref Name | Used | Functional Category |
|----------|------|---------------------|
| FDRE     | 1536 | Flop & Latch        |
| LUT2     | 1328 | LUT                 |
| LUT6     | 1184 | LUT                 |
| LUT4     | 672  | LUT                 |
| IBUF     | 513  | IO                  |
| CARRY4   | 444  | CarryLogic          |
| LUT5     | 96   | LUT                 |
| LUT3     | 96   | LUT                 |
| OBUF     | 16   | IO                  |
| BUFG     | 1    | Clock               |

## Timing Estimation:

| Setup                                       | Hold                                    | Pulse Width                                              |
|---------------------------------------------|-----------------------------------------|----------------------------------------------------------|
| Worst Negative Slack (WNS): <b>7.538 ns</b> | Worst Hold Slack (WHS): <b>0.152 ns</b> | Worst Pulse Width Slack (WPWS): <b>4.500 ns</b>          |
| Total Negative Slack (TNS): <b>0.000 ns</b> | Total Hold Slack (THS): <b>0.000 ns</b> | Total Pulse Width Negative Slack (TPWS): <b>0.000 ns</b> |
| Number of Failing Endpoints: <b>0</b>       | Number of Failing Endpoints: <b>0</b>   | Number of Failing Endpoints: <b>0</b>                    |
| Total Number of Endpoints: <b>1024</b>      | Total Number of Endpoints: <b>1024</b>  | Total Number of Endpoints: <b>1537</b>                   |

All user specified timing constraints are met.

## Power Estimation:

Power estimation from Synthesized netlist. Activity derived from constraints files, simulation files or vectorless analysis. Note: these early estimates can change after implementation.

**Total On-Chip Power:** **0.162 W**  
**Design Power Budget:** **Not Specified**  
**Power Budget Margin:** **N/A**  
**Junction Temperature:** **25.7°C**  
Thermal Margin: 59.3°C (12.9 W)  
Effective θJA: 4.6°C/W  
Power supplied to off-chip devices: 0 W  
Confidence level: **Low**

[Launch Power Constraint Advisor](#) to find and fix invalid switching activity

