

# A Generated Multirate Signal Analysis RISC-V SoC in 16nm FinFET

Steve Bailey\*, Jaeduk Han\*, Paul Rigge\*, Richard Lin\*, Eric Chang\*, Howard Mao\*, Zhongkai Wang\*, Chick Markley\*, Adam Izraelevitz\*, Angie Wang\*, Nathan Narevsky\*, Woorham Bae\*, Steve Shauck†, Sergio Montano†, Justin Norsworthy†, Munir Razzaque†, Wen Hau Ma†, Akalu Lentiro†, Matthew Doerlein†, Darin Heckendorf†, Jim McGrath†, Franco DeSeta†, Ronen Shoham†, Mike Stellfox†, Mark Snowden†, Joseph Cole†, Dan Fuhrman†, Brian Richards\*, Jonathan Bachrach\*, Elad Alon\*, and Borivoje Nikolic\*

\*EECS, University of California, Berkeley    †Northrop Grumman Corporation    ‡Cadence Design Systems, Inc

**Abstract**—This paper demonstrates a signal analysis SoC consisting of a general-purpose RISC-V core with vector extensions and a fixed-function signal-processing accelerator. Both the core and the accelerators are instances produced by novel generators that allow for a wide range of parameter configurations and rapid design space exploration. The signal processing chain consists of generated instances of a time-interleaved ADC followed by a digital tuner, FIR filter, polyphase filter, and FFT all connected to the processor via an AXI4 bus. The 5 mm×5 mm chip is implemented in a 16 nm FinFET process and operates at 410 MHz at 750 mV drawing 600 mW. Presented applications show coupled functionality of the processor and accelerator performing spectrometry and radar receive processing, and a comparison with other state-of-the-art ASICs prove that generators can produce competitive designs.

## I. INTRODUCTION

There is a continuing need for energy efficiency improvements through hardware specialization, yet custom IC development costs have become prohibitively large. A limiting factor in hardware design efforts is the lack of design reuse, currently relegated to (sometimes hardened) IP targeting the chosen application. Proposed solutions include raising the level of design (e.g., HLS [1]), embedding the design process into a reusable generator [2], [3], or having one expert engineer replicate a simple design [4]. However, there are no complete examples of performance-competitive, mixed-signal SoC designs reported in open literature that demonstrate a comprehensive use of both digital and mixed-signal generators for both design and verification.

This work demonstrates a complex SoC containing a signal-processing accelerator and a general-purpose processor from a set of novel, open-source digital and analog generators [5] written in Chisel [6] and BAG [7]. The general-purpose processor is a generated RISC-V core with new ISA accelerators, and the signal processing accelerator comes from a new generator of streaming signal processing functions. Section II describes the architecture produced by the generator. Testing and measurement results are given in Section III. The presented instance is general enough to apply in a variety of signal processing contexts, and Section IV demonstrates several of these applications.

## II. SOC ARCHITECTURE

Figure 1 shows the system architecture. The SoC is divided into a general-purpose processor and a custom signal processing accelerator. Communication between the processor and the accelerator is handled through a memory-mapped IO manager.

### A. General-Purpose Processor

The chip includes a general-purpose processor that connects a host with the chip, programs the signal-processing accelerator, moves data, and computes what other on-chip accelerators cannot. The 64-bit single-issue in-order RISC-V Rocket CPU includes a single-/double-precision (SP/DP) floating-point unit (FPU). A direct memory access (DMA) accelerator offloads memory movement between the processors from the CPU. The 4-lane high-performance Hwacha vector accelerator implements a decoupled vector-fetch architecture and can perform compute-intensive parallel workloads not handled by the signal processing accelerator. A serial adapter tethers the CPU to the FPGA host, which writes programs to the on-chip 8 MB main



Fig. 1. Block diagram of the SoC architecture.



Fig. 2. Detailed diagram of the processing elements in the DSP accelerator. Red boxes indicate memory-mapped IO SCR. Green overlays show generator parameters. Blue text gives fixed-point data type parameters chosen. CQ = complex fixed-point number.

memory before booting the core. A custom clock receiver supplies the core clock, and an asynchronous FIFO permits low-speed communication between the host and Rocket CPU.

### B. Digital Signal-Processing Accelerator

The digital signal-processing accelerator architecture is reminiscent of SDF [8], [9], with actors (here processing elements) communicating through streaming interfaces rather than FIFO tokens. The Chisel generator contains memory-mapped IO registers taking commands from either the CPU or a JTAG debug port. Asynchronous FIFOs buffer data between the core clock, DSP clock, and JTAG clock domains. Separate AXI4 crossbars access status and control registers (SCRs) and data buffers (SAMs). For testing, a 512 KB pattern generator and 512 KB logic analyzer allow direct access to inputs and outputs of individual processing elements (PEs). A chain of PEs receives data from a BAG-generated 8-bit time-interleaved successive approximation (TISAR) ADC with lookup table (LUT)-based static calibration, and it also provides a clock to the PEs and AXI4 crossbars. The UART provides a backup interface into the RISC-V core, and the JTAG provides a backup interface into the accelerator.

### C. Processing Elements

The selected PEs implement a signal analysis accelerator targeting spectral analysis or radar receive chain processing. Figure 2 shows the parameterization of PEs in green atop the final implementation diagram. The ADC LUT outputs 9 bits for testing, so a custom bit manipulator (BM) PE truncates this to 8 bits. The next two PEs comprise a digital down-converter (DDC). A 32-entry LUT-based digital tuner mixes the input signal with a complex sinusoid, and a fully-programmable 136-tap complex FIR filter shapes the signal and decimates it by 8. A 12-tap fixed-function polyphase filter multiplies the time-series data by a sinc function to window each FFT bin and reduce frequency-domain spectral leakage. A 128-point radix-2 FFT, comprised of 32-point biplex pipelined FFTs and a 4-point direct-form FFT, produces the complex spectrum output. The chain generator supports arbitrary ordering and duplication of PEs, so the chosen arrangement of PEs represents just one possible DSP accelerator configuration.

### D. Instance Verification

Verifying and testing the generated instances is aided by coupling the design with design-for-test (DFT) structures.

Generated test benches adjust to chosen design parameters, and unit tests are run both in simulation and on the fabricated chip. Unit test vectors are generated in Python, which are then passed to a generated Unified Verification Methodology (UVM) testbench in Cadence's Verification Workbench (VWB) to verify the instance. A pattern generator and logic analyzer connect to each PE's IO and perform the same unit-level verification on the chip after fabrication. A JTAG debug module provides access to the signal-processing accelerator as a backup to the CPU. The CPU and signal-processing accelerator pass all ISA and unit-level tests, as well as comprehensive benchmarks (e.g. dhystone) and kernels (e.g. matrix multiply).

### III. TESTING RESULTS AND MEASUREMENTS

The chip is implemented in TSMC's 16nm FinFET technology and signed off at 300 MHz for both the core and DSP clock domains at 0.72 V and 125°C. Figure 3 shows the 5 mm by 5 mm annotated layout, die photo, and chip summary. The 8 MB main memory, Hwacha vector accelerator, and various other memories comprise most of the area. Also visible is the 136-tap fully programmable FIR filter, composed of



Fig. 3. Chip layout, die photo, and summary



Fig. 4. Both processors function under similar operating condition ranges. The general-purpose processor consumes more power because of the 8 MB main memory.

many complex multipliers and adders. By using the Hwacha vector accelerator, at these conditions the general-purpose processor achieves 23.4 GFLOPS/W running double-precision matrix multiply on  $256 \times 256$  matrices. For the general-purpose processor, throughput is measured by moving FFT output data from the SAM to the CPU memory and accumulating spectra using either the vector co-processor or the scalar ALU and DMA. Throughput is measured for the signal-analysis processor at the maximum operating frequency (see Figure 7 for more on spectral rates). Efficiency for the signal-analysis processor accounts for all PEs, and one operation is anything from a real 8-bit add to a 17-bit multiply, with complex adds and multiplies broken into their real operations.

Figure 4 shows the shmoo and power plots for two processors. Both function down to 0.56 V (the nominal supply is 0.8 V) and up to 410 MHz. A success on the shmoo plot requires the general-purpose processor to pass all ISA unit tests and the signal-analysis accelerator to pass all PE unit tests. Annotated on the shmoo plots are the corner values, i.e. the minimum voltage at the maximum frequency and the maximum frequency at the minimum supply voltage. Power measurements are averaged over the same unit tests as the shmoo. The chip consumes less than 1 W total for all modes.

Figure 5 shows the power and a typical calibration result for the 8 slice time-interleaved SAR ADC, which features a fractional radix to produce 8 real output bits from 9 total bits. The ADC reaches a max of 6.6 ENOB per slice and 6.4 ENOB total at 6 GS/s under a 0.9V supply. Analog supplies are independent from digital supplies.

#### IV. SIGNAL ANALYSIS APPLICATIONS

In this section we present two applications running on the SoC. Using the pattern generator to produce arbitrary waveforms and replace the ADC as the input source hastened



Fig. 5. Typical ADC power consumption is less than 50 mW at 0.9V. Calibration of the ADC reduces noise and spurs.

the development and debugging of writing these programs by decoupling program errors from ADC miscalibration, input noise, or instrument errors. The applications utilize many SoC features, including the complete signal processing chain, DMA, vector accelerator, and general-purpose processor. We compare the results with similar, fixed-function processors.

##### A. Spectrometry

Atmospheric spectrometers monitor molecule emissions to determine the composition of gases. Given the low SNR of these emissions, spectrometers with a wide bandwidth and long accumulation time are desired. A 512-point FFT is formed by sweeping the tuner frequency to allow the signal processor to analyze up to four frequency bands with a 128-pt FFT engine (the filter decimates the data rate by eight, but half the bands are symmetric because the input is real-valued). Figure 6 shows the signal processing path of a spectrometer input dataset. The low-pass filter, with constrained equiripple coefficients designed in MATLAB, passes the lower eighth of the spectrum to avoid aliasing when down converting. Its stopband is at 40 dB below the passband, resulting in visible aliasing above the noise floor in the combined spectrum. For each band, the tuner is set and the FFT outputs are stored in the SAM before being moved into the general-purpose processor's memory by the DMA. The power is calculated and accumulated using the vector accelerator. The use of these accelerators boosts the data processing rate for this application by over 10x, as seen in Figure 7. While not



Fig. 6. Spectrometer signal processing example. Snapshots captured in the SAMs. (1) A real-valued signal is sampled through the calibrated ADC, producing a symmetric spectrum. (2) Four tuner LO frequencies are mixed with the input, producing four frequency-shifted spectra. (3) These spectra are low-pass filtered and down-converted by 8, resulting in four separate frequency bands. (4) The bands are Fourier transformed, accumulated, and combined on the CPU. This figure shows 100 accumulated spectra.



Fig. 7. Using the vector and DMA accelerators speeds up spectral accumulation by over 10x. This plot includes the overhead of sweeping the tuner frequency to monitor four frequency bands, so each spectrum is 512 channels.

designed specifically for spectrometry, this work is competitive with other published ASIC spectrometers, as seen in Table I.

### B. Radar

Unlike atmospheric spectrometry, radar operates on fixed or variable short pulses or frequency-modulated continuous wave (FMCW) signals. These higher SNR signals require less accumulation, but processing speed limits the detectable range resolution. Figure 8 shows an example measured spectrogram of 4  $\mu$ s fixed-frequency pulses, repeating every 8  $\mu$ s. The tuner is set to view frequency bands containing the expected signal, and the FFT outputs a spectrum every 171 ns when the ADC operates at 6 GS/s. For unmodulated pulsed signals as in Figure 8, the minimum pulse repetition frequency (PRF) this SoC can resolve is 2.9 MHz (pulses per second), leading to a minimum range resolution of

$$\frac{c}{2 \times PRF} = 51.7 \text{ m.} \quad (1)$$

It is possible to increase resolution by implementing pulse compression. The vector accelerator has sufficient throughput to convolve the received signal with the expected signal, and the FFT may be reused to perform an IFFT to recover the

TABLE I  
COMPARISON OF STATE-OF-THE-ART ASIC SPECTROMETERS

|                      | CICC'09 [10]    | CICC'15 [11]  | CICC'18 [12]    | This Work       |
|----------------------|-----------------|---------------|-----------------|-----------------|
| Technology           | 90nm CMOS       | 65nm CMOS     | 28nm FDSOI      | 16nm FinFET     |
| Bandwidth            | 0.75 GHz        | 1.1 GHz       | <b>8.5 GHz</b>  | 3.0 GHz         |
| FFT Size             | <b>8192 pts</b> | 512 pts       | <b>8192 pts</b> | 128~512 pts     |
| Integrated ADC       | No              | <b>Yes</b>    | No              | <b>Yes</b>      |
| Power                | 1500 mW*        | <b>188 mW</b> | 5200 mW         | 586 mW          |
| ADC Output           | <b>8 bits</b>   | 7 bits        | 3 bits          | <b>8 bits</b>   |
| Can post-process     | No              | No            | No              | <b>Yes</b>      |
| On-chip Accum. Depth | 16M Spectra     | 1024 Spectra  | 65520 Spectra   | <b>Infinite</b> |

\*excludes ADC power



Fig. 8. Measured spectrogram of a 4  $\mu$ s pulse at 876 MHz.

compressed signal. At 6 GS/s and by using a single frequency band with a 750 MHz wide linear frequency-modulated (LFM) chirp, this system has a minimum range resolution of 0.2 m.

### V. CONCLUSION

This work demonstrates an ASIC, designed by using parameterized digital and analog generators, that achieves a peak efficiency of over 19 TOPS/W and 23 GFLOPS/W in 16 nm CMOS. The implemented RISC-V signal analysis SoC, generated from Chisel and BAG frameworks, performs spectrometry and radar signal processing with performance comparable to the state of the art. On-chip DFT facilitates quick bring-up and validation of the design instance. Generators used in this work are open source and may be easily adjusted and reused for a variety of applications [5].

### ACKNOWLEDGMENT

This work was funded in part by the DARPA CRAFT program (HR0011-16-C-0052), BWRC, and ADEPT (Intel iSTC on Agile Design).

### REFERENCES

- [1] D. L. Rosenband and Arvind, "Hardware synthesis from guarded atomic actions with performance specifications," in *ICCAD*, November 2005.
- [2] O. Shacham, S. Galal, S. Sankaranarayanan *et al.*, "Avoiding game over: Bringing design to the next level," in *DAC*, June 2012.
- [3] B. Nikolic, "Simpler, more efficient design," in *ESSCIRC*, Sept 2015.
- [4] A. Olofsson, "Epiphany-v: A 1024 processor 64-bit RISC system-on-chip," *CoRR*, vol. abs/1610.01832, 2016. [Online]. Available: <http://arxiv.org/abs/1610.01832>
- [5] <https://github.com/ucb-art/craft2-chip>.
- [6] J. Bachrach, H. Vo, B. Richards *et al.*, "Chisel: Constructing hardware in a scala embedded language," in *DAC*, Jun. 2012.
- [7] E. Chang, J. Han, W. Bae *et al.*, "Bag2: A process-portable framework for generator-based ansn circuit design," in *CICC*, Apr. 2018.
- [8] E. A. Lee and D. G. Messerschmitt, "Synchronous data flow," *Proceedings of the IEEE*, vol. 75, no. 9, Sept 1987.
- [9] L. Li, T. Fanni, T. Viitanen *et al.*, "Low power design methodology for signal processing systems using lightweight dataflow techniques," in *DASIP*, Oct 2016.
- [10] B. Richards, N. Nicolici, H. Chen *et al.*, "A 1.5GS/s 4096-point digital spectrum analyzer for space-borne applications," in *CICC*, Sept 2009.
- [11] F. Hsiao, A. Tang, Y. Kim *et al.*, "A 2.2GS/s 188mW spectrometer processor in 65nm CMOS for supporting low-power THz planetary instruments," in *CICC*, Sept 2015.
- [12] S. Bailey, J. Wright, N. McMeekin *et al.*, "A 28nm fdsOI 8192-point digital ASIC spectrometer from a chisel generator," in *CICC*, Apr. 2018.