

# **Calculs mathématiques**

## **Les blocs DSP**

**fabien.vannel@hesge.ch**

L'avenir est à créer

# Essence of a DSP Processor

Algorithm must be implemented within the constraints of the pre-defined fixed architecture



# Sequential Processing Limits System Performance



# Multiply Accumulate Single Engine

- Sequential processing limits data throughput
  - Time-shared MAC unit
  - High clock frequency creates difficult system challenge
- 256 Tap FIR Filter
  - 256 multiply and accumulate (MAC) operations per data sample
  - One output every 256 clock cycles



# Multiply Accumulate Multiple Engines

- Parallel processing maximizes data throughput
  - Support any level of parallelism
  - Optimal performance/cost tradeoff
- 256 Tap FIR Filter
  - 256 multiply and accumulate (MAC) operations per data sample
  - One output every clock cycle
- Flexible architecture
  - Distributed DSP resources (LUT, registers, multipliers, & memory)



# Virtex II -Enabling high-performance DSP

Virtex-II introduced the embedded 18x18 multiplier



- Situated between the Block RAMs and CLB array to enable high-performance multiply-accumulate operations
- This dramatically increased multiplier speed and density compared to LUT based multipliers and enabled FPGA based DSP

# The Virtex-4 SX platform

- Virtex-4 introduced a new DSP block that had both multiply and accumulate functionality
- For the first time a true “MAC” unit was offered in a Xilinx FPGA. This block was called the DSP48 due to it's 48-bit output precision
- Additional modes of the adder allowed subtract and shift functions to support scaling of results
- Integral registers guarantee high-speed pipelined data-paths for maximum clock frequency

# DSP48 Block

Includes a high performance arithmetic unit and a multiplier



# DSP48 Block

Dynamically Programmable DSP Op Modes

| OpMode                    | Z | Y | X | Output |   |   |   |                          |
|---------------------------|---|---|---|--------|---|---|---|--------------------------|
|                           | 6 | 5 | 4 | 3      | 2 | 1 | 0 |                          |
| Zero                      | 0 | 0 | 0 | 0      | 0 | 0 | 0 | +/- Cin                  |
| Hold P                    | 0 | 0 | 0 | 0      | 0 | 1 | 0 | +/- (P + Cin)            |
| A:B Select                | 0 | 0 | 0 | 0      | 0 | 1 | 1 | +/- (A:B + Cin)          |
| Multiply                  | 0 | 0 | 0 | 0      | 1 | 0 | 1 | +/- (A * B + Cin)        |
| C Select                  | 0 | 0 | 0 | 1      | 1 | 0 | 0 | +/- (C + Cin)            |
| Feedback Add              | 0 | 0 | 0 | 1      | 1 | 1 | 0 | +/- (C + P + Cin)        |
| 36-Bit Adder              | 0 | 0 | 0 | 1      | 1 | 1 | 1 | +/- (A:B + C + Cin)      |
| P Cascade Select          | 0 | 0 | 1 | 0      | 0 | 0 | 0 | PCIN +/- Cin             |
| P Cascade Feedback Add    | 0 | 0 | 1 | 0      | 0 | 1 | 0 | PCIN +/- (P + Cin)       |
| P Cascade Add             | 0 | 0 | 1 | 0      | 0 | 1 | 1 | PCIN +/- (A:B + Cin)     |
| P Cascade Multiply Add    | 0 | 0 | 1 | 0      | 1 | 0 | 1 | PCIN +/- (A * B + Cin)   |
| P Cascade Add             | 0 | 0 | 1 | 1      | 1 | 0 | 0 | PCIN +/- (C + Cin)       |
| P Cascade Feedback Add Ad | 0 | 0 | 1 | 1      | 1 | 1 | 0 | PCIN +/- (C + P + Cin)   |
| P Cascade Add Add         | 0 | 0 | 1 | 1      | 1 | 1 | 1 | PCIN +/- (A:B + C + Cin) |
| Hold P                    | 0 | 1 | 0 | 0      | 0 | 0 | 0 | P +/- Cin                |
| Double Feedback Add       | 0 | 1 | 0 | 0      | 0 | 1 | 0 | P +/- (P + Cin)          |
| Feedback Add              | 0 | 1 | 0 | 0      | 0 | 1 | 1 | P +/- (A:B + Cin)        |
| Multiply-Accumulate       | 0 | 1 | 0 | 0      | 1 | 0 | 1 | P +/- (A * B + Cin)      |
| Feedback Add              | 0 | 1 | 0 | 1      | 1 | 0 | 0 | P +/- (C + Cin)          |
| Double Feedback Add       | 0 | 1 | 0 | 1      | 1 | 1 | 0 | P +/- (C + P + Cin)      |
| Feedback Add Add          | 0 | 1 | 0 | 1      | 1 | 1 | 1 | P +/- (A:B + C + Cin)    |
| C Select                  | 0 | 1 | 1 | 0      | 0 | 0 | 0 | C +/- Cin                |
| Feedback Add              | 0 | 1 | 1 | 0      | 0 | 1 | 0 | C +/- (P + Cin)          |
| 36-Bit Adder              | 0 | 1 | 1 | 0      | 0 | 1 | 1 | C +/- (A:B + Cin)        |
| Multiply-Add              | 0 | 1 | 1 | 0      | 1 | 0 | 1 | C +/- (A * B + Cin)      |
| Double                    | 0 | 1 | 1 | 1      | 1 | 0 | 0 | C +/- (C + Cin)          |
| Double Add Feedback Add   | 0 | 1 | 1 | 1      | 1 | 1 | 0 | C +/- (C + P + Cin)      |
| Double Add                | 0 | 1 | 1 | 1      | 1 | 1 | 1 | C +/- (A:B + C + Cin)    |

- Enables time-division multiplexing for DSP
- Over 40 different modes
- Each XtremeDSP Slice individually controllable
- Change operation in a single clock cycle
- Control functionality from logic, memory or processor

# DSP48 Block

Useful For More Than DSP

- 6:1 high-speed, 36-bit Multiplexer
  - Use four XtremeDSP Slice and op-modes
  - 500 MHz performance using no programmable logic
    - Save 1584 LCs to build equivalent function in logic
- Dynamic 18-bit Barrel Shifter
  - Use two XtremeDSP slices
  - Use dedicated cascade routing and integrated 17-bit shift
    - Save 1449 LCs to build equivalent function in logic
- 36-bit Loadable Counter
  - Use a single XtremeDSP slice, achieve 500 MHz performance
    - Save 540 LCs to build equivalent function in logic

# Virtex 5 – DSP48E

- Virtex-5SX introduced a few new improvements in the DSP48E “enhanced” DSP block
- The adder block was modified to become a multifunctional ALU. A pattern compare was added to support the detection of saturation, overflow and underflow conditions
- A 48-bit carry chain supports the propagation of partial sum and product carry's so multiple DSP48E blocks can be chained to give higher bit precision
- ALU opcodes are dynamically controlled allowing functional changes on a clock cycle basis

# DSP48E Block

Includes a high performance ALU, pattern compare, and a multiplier



450 MHz operation in the slowest speed grade

# Spartan-3A DSP

- Incorporates the primary features from earlier Virtex family DSP48 blocks
- The DSP48A block supports full MAC support with a pre-adder stage, multiplier, and add/accumulate state
- Dedicated DSP block offer the lowest cost/MAC in a FPGA

# DSP48A Block

Incorporates primary features from V4 DSP48 and includes a pre-adder stage

- Integrated XtremeDSP Slice
  - Application optimized capacity
    - 3400A – 126 DSP48As
    - 1800A – 84 DSP48As
  - Integrated pre-adder optimized for filters
  - 250 MHz operation, standard speed grade
  - Compatible with Virtex-DSP



- Increased memory capacity and performance
  - Also important for embedded processing, complex IP, etc

# The Xilinx 7 Series FPGAs

## Industry's First Unified Architecture

- Industry's Lowest Power and First Unified Architecture
  - Spanning Low-Cost to Ultra High-End applications
- Three new device families with breakthrough innovations in power efficiency, performance-capacity and price-performance

|                         | ARTIX <sup>7</sup>    | KINTEX <sup>7</sup>                      | VIRTEX <sup>7</sup>                      |
|-------------------------|-----------------------|------------------------------------------|------------------------------------------|
|                         | Lowest Power & Cost   | Industry's Best Price/Performance        | Industry's Highest System Performance    |
| Logic Cells             | <b>20K – 355K</b>     | <b>30K – 410K</b>                        | <b>285K – 2,000K</b>                     |
| DSP Slices              | <b>40 – 700</b>       | <b>120 – 1540</b>                        | <b>700 – 3,960</b>                       |
| Max. Transceivers       | <b>4</b>              | <b>16</b>                                | <b>80</b>                                |
| Transceiver Performance | <b>3.75Gbps</b>       | <b>6.6Gbps<br/>10.3Gbps</b>              | <b>10.3Gbps<br/>13.1Gbps<br/>28Gbps</b>  |
| Memory Performance      | <b>800Mbps</b>        | <b>2133Mbps</b>                          | <b>2133Mbps</b>                          |
| Max. SelectIO™          | <b>450</b>            | <b>500</b>                               | <b>1200</b>                              |
| SelectIO™ Voltages      | <b>3.3V and below</b> | <b>3.3V and below<br/>1.8V and below</b> | <b>3.3V and below<br/>1.8V and below</b> |

# Serie 7- DPS48E1



\*These signals are dedicated routing paths internal to the DSP48E1 column. They are not accessible via fabric routing resources.

UG369\_c1\_01\_052109

Figure 2-1: 7 Series FPGA DSP48E1 Slice

# Serie 7- DPS48E1

- Wider functionality in DSP48E1 than DSP48A1:
  - Multiplier width is improved from  $18 \times 18$  in the Spartan-6 family to  $25 \times 18$  in the 7 series
  - The A register width is improved from 18 bits in the Spartan-6 family to 30 bits in the 7 series:
    - A and B registers can be concatenated in the 7 series
    - The A register feeds the pre-adder in the 7 series instead of the B register
  - Cascading capability on both pipeline paths for larger multipliers and larger post-adders
- Unique features in DSP48E1 over DSP48A1:
  - Arithmetic logic unit (ALU)
  - SIMD mode
  - Pattern detector
  - 17-bit shifter

# Why should I use FPGAs for DSP?

# Reason 1: FPGAs handle high computational workloads

Speed up FIR Filters by implementing with parallel architecture

## Programmable DSP - Sequential



$$\frac{1 \text{ GHz}}{256 \text{ clock cycles}} = 4 \text{ MSPS}$$

## FPGA - Fully Parallel Implementation



$$\frac{500 \text{ MHz}}{1 \text{ clock cycle}} = 500 \text{ MSPS}$$

## Example 256 TAP Filter Implementation

## Reason 2: FPGAs are ideal for multi-channel DSP Designs

Can implement multiple channels running in parallel or time multiplex channels into one filter



- Many low sample rate channels can be multiplexed (e.g. TDM) and processed in the FPGA, at a high rate
- Interpolation (using zeros) can also drive sample rates higher

# Reason 3: Customize Architectures to Suit your Goals

FPGAs allow Cost/Performance tradeoffs



## Reason 4: Lower System Cost through Integration

Implement Interface Logic within FPGA to connect DSP functions to I/O and Memory Devices



# The XtremeDSP Slice Advantage

Without XtremeDSP Slice, Parallel Adder Tree Consumes Logic Resources



# The XtremeDSP Slice Advantage

With XtremeDSP Slice, Parallel Adder Tree Consumes Zero Logic Resources

Parallel Adder Cascade Implementation



32 TAP filter implementation implemented entirely with XtremeDSP Slices