

# EL6443 VLSI-II Project: 256x4b MBIST

Vidyut Singh (vps4038), Abrar Fahim (af4175)

**Abstract**—MBIST, or Memory Built-in Self-Test, is an industry-standard for testing embedded memories of a system. The embedded memory of a system is tested using various sets of algorithms to detect different types of faults that may reside inside of memory, such as Stuck At Fault (SAF), Transitional Delay Fault (TDF), Coupling Fault (CF), Neighbourhood Pattern Sensitive Faults (NPSF). Since the method is self-testing, a single start signal is provided and the test is carried out on its own using stored patterns.

## I. PROJECT OVERVIEW

This project aims to design an MBIST for testing a 256 x 4 bit SRAM cell using the checkerboard and blanket of 0 and 1 pattern where both the BIST and the SRAM cell operate on the same clock.

The project's Behavioral description (RTL Design) is written in Verilog, and the RTL design functionality is tested using the HDL compiler and visualization tools – VCS and Verdi. Synthesis and optimization of the design is done using Genus.

## II. INITIAL DESIGN PRINCIPLE AND OPERATION

### A. Design Components



Fig. 1: Schematic of MBIST Dataflow.

The MBIST {Fig. 1} uses the following basic elements in its design to produce a blanket of 0s and 1s, checkerboard and reverse checkerboard patterns:

**Memory** – 256 word x 4 bit SRAM that is going to be tested by the MBIST.

**Counter** – 12-bit counter to generate addresses and decoder inputs. The counter can be considered the heart of the MBIST. It generates the address for the memory and generates the decoder inputs. Based on the input, the decoder generates the patterns, which will be either stored (write function) in the memory module or sent to the comparator to compare data

retrieved (read function) from the memory module. When the counter generates a stop bit, the counter value again resets to zero.

**Decoder** – 4-bit wide decoder to generate 3 types of patterns – ‘0000’, ‘0101’ and ‘0011’ – and their inverse. All decoder patterns are shown in Fig. 2

| din[2:0], din[3] | out_ppln | d_out |
|------------------|----------|-------|
| 00 00            | 0000     | 1111  |
| 00 01            |          | 0000  |
| 00 10            |          | 0000  |
| 00 11            |          | 1111  |
| 01 00            | 0101     | 1010  |
| 01 01            |          | 0101  |
| 01 10            |          | 0101  |
| 01 11            |          | 1010  |
| 10 00            | 0011     | 1100  |
| 10 01            |          | 0011  |
| 10 10            |          | 0011  |
| 10 11            |          | 1100  |
| 11 xx            | xxxx     | xxxx  |

Fig. 2: Patterns generated by decoder.

**BIST Controller** – Two-state FSM that alternates between functional and test mode states. The BIST controller can be considered the brain of the MBIST. The BIST controller switches to test mode on receiving the start signal and waits for the stop bit to be generated to send the reset signal to the counter.

**Multiplexer** – These choose between the functional memory address and data or the MBIST generated values based on the test mode signal.

**Comparator** – 2-input comparator to check that decoder and memory outputs are equal.

### B. Design Parameters

The size of the SRAM is 256 x 4 bit. This implies that there are 256 words, and each word has a data length of 4 bits. Therefore, the address required to access each word in the memory is  $\log_2(256) = 8$  bits in length.

A single bit is used to indicate the Read or Write operation to the system.

The pattern generation decoder uses a 4 bit input. Among the 4 bits, 2 bits are directly used for generating the pattern. Two additional bits are used for generating the checkerboard pattern and the alternation of the checkerboard patterns. Here, the 0th bit of the counter gets reused to allow for alternating patterns to be applied at each memory address.



Fig. 3: Translation of counter values to decoder input, address input, and read/write bit.

Therefore, the total length of the counter = 3 bits + 1 bits + 8 bits = 12 bits. Fig. 3 shows the mapping of counter bits.

### C. Parametrization

The design is parameterized {Fig. 4} – allowing for varying sizes of SRAM modules to be tested. The design supports a minimum word length of 4 bits as that is the smallest output length the decoder can generate. The two parameters used are **wcount** for the word count and **wlength** for the word length.

```
//TOP LEVEL MODULE DESIGN:
//Module Definition:
module TOP #(parameter wcount=256, wlength=4)(
    input wire start, rst, clk, rwbarin,
    input wire [wlength-1:0] datain,
    input wire [$clog2(wcount)-1:0] address,
    output wire [wlength-1:0] dataout,
    output reg fail
);

```

Fig. 4: Parametrizing the MBIST design.

## III. DESIGN IMPROVEMENTS

The initial design provided was extensively pipelined and modified to support the required clock period of **0.6 ns**. Over the courses of the following 4 stages we were not only able to achieve the required clock speed, but also ensure our design operates at 2x the speed as well.

### A. Stage I - RTL Improvements

**Registering module outputs** – The intial design was updated so that combinational elements such as the decoder, multiplexers and comparator have registered outputs to break the a single into smaller paths with lower propagation delay – shown in Fig. 5.

**Pipelining address decode** – The 256 word SRAM was split into smaller 4x64 word SRAM cells and one additional stage of pipelining was added internally for address decode and memory read. Hence, a memory read/write takes a total of 2 clock cycles, shown in Fig. 6.

```
//Comparator:
module comparator #(parameter wlength=4)(
    input wire clk,
    input wire [wlength-1:0] in1, in2,
    output reg out
);
begin
    assign check = (in1 == in2) ? 1'b1 : 1'b0;
    always@(posedge clk) begin
        out <= check;
    end
endmodule

//MUX:
module mux #(parameter length=4)(
    input clk,
    input wire [length-1:0] in0, in1,
    output reg [length-1:0] out
);
begin
    always@(posedge clk) begin
        case(sel)
            'd0: out <= in0;
            'd1: out <= in1;
            default: out <= {length{1'b0}};
        endcase
    end
endmodule
```

Fig. 5: Registering module outputs.

```
always@(posedge clk) begin
    /*Write*/
    mem_sel <= addr[$clog2(wcount)-1:$clog2(wcount)-2];
    mem_addr <= addr[$clog2(wcount)-3:0];
    datain_reg <= datain;
    we_reg <= we;
    if (we_reg) begin
        case(mem_sel)
            'd0: ram1[mem_addr] <= datain_reg;
            'd1: ram2[mem_addr] <= datain_reg;
            'd2: ram3[mem_addr] <= datain_reg;
            'd3: ram4[mem_addr] <= datain_reg;
            default: ; //do nothing
        endcase
    end
    mem_sel_reg <= mem_sel;
    mem_addr_reg <= mem_addr;
end

/* Continuous assignment implies read ready */
This is the natural behavior of the Tri-state RAM
always@(*) begin
    case(mem_sel_reg)
        'd0: dataout = ram1[mem_addr_reg];
        'd1: dataout = ram2[mem_addr_reg];
        'd2: dataout = ram3[mem_addr_reg];
        'd3: dataout = ram4[mem_addr_reg];
        default: ; //do nothing
    endcase
end
```

Fig. 6: Pipelining address decode.

**Pipelining pattern decode** – Decoder logic is also split into two combinational paths with a pipeline, where the first stage decides the pattern to send and the second stage decides whether the pattern needs to alternate. The decoder takes a total of 2 clock cycles to generate the required pattern, as shown in Fig. 7.

```
//CB/RCB DECODER:
module decoder #(parameter wlength=4)(
    input wire clk,
    input wire [3:0] in,
    output reg [wlength-1:0] out
);

reg [wlength-1:0] out_temp_reg;
reg negate;

always@(posedge clk) begin
    case(in[2:1])
        'd0: out_temp_reg <= {wlength/4{4'b0000}};
        'd1: out_temp_reg <= {wlength/4{4'b0101}};
        'd2: out_temp_reg <= {wlength/4{4'b0011}};
        default: out_temp_reg <= {wlength{1'bX}};
    endcase
    negate <= in[3]~^in[0];
    out <= negate ? ~out_temp_reg : out_temp_reg;
end

endmodule
```

Fig. 7: Pipelining pattern decode.

**Additional pipelining** – Additional pipelining was done to ensure that signals arrive at their destination at the desired clock cycle. Fig. 8 shows the cycle-level information of each pipeline stage in the design.



Fig. 8: Final Timing Waveform.



(a) Initialization – sending start bit.

(b) Operation – write and read comparison.

(c) Stop – exiting test mode.

Fig. 9: Simulation Waveforms.

## B. Stage II - Simulation

This design was simulated using VCS/Verdi to verify functionality. Fig. 9 show three waveforms from the initialization, operation and stop stages of the MBIST.

## C. Stage III - Fixing Genus Issues

Genus threw two errors during synthesis – “TUI-61: A required object parameter could not be found” and “Unable to map design without a tristate buffer or inverter. [MAP-1] [map\_library\_sanity\_check]”.

Fixing the TUI-61 error required providing the parameter values during elaboration in Genus using the following command – `elaborate -parameters {256 4}`

The MAP-1 error was solved by using ‘x’ (unknown or indeterminate logic value) instead of ‘z’ (high impedance or tri-state signal) in the default case statements inside the decoder, multiplexer and comparator module.

## D. Stage IV - Fixing Timing Violations

After running synthesis, the Genus output log showed a timing violation of 181ps {Fig. 10} from the memory address register to the comparator.



Fig. 10: Timing Violation 1.

We further granularized the memory module to 8x32 word memories as opposed to 4x64 word memories, and also added a pipeline stage between the memory output and the comparator input to improve timing. The negative slack was improved by 20ps with this update, however the timing violation persists. {Fig. 11}



Fig. 11: Timing Violation 2.

We modified the memory better divide the address decode for memory read. In the first cycle, data from each memory block is stored in a data buffer and the required data buffer is picked in the next cycle. This can be loosely considered as a hierarchical memory. Even with this improvement, we saw negative slack reduced by 25ps, but the timing violation was not resolved. {Fig. 12}

```

Path 1: VIOLATED (-136 ps) Setup Check with Pin RAM_MEM_dataout_buffer6_reg[2]/CLK->O
  Group: clk
  Startpoint: (R) RAM_MEM_mem_addr_reg[2]/CLK
  Clock: (R) clk
  Endpoint: (R) RAM_MEM_dataout_buffer6_reg[2]/O
  Clock: (R) clk

    Capture      Launch
  Clock Edge:+ 1          0
  Src Latency:+ 0          0
  Net Latency:+ 0 (I)     0 (I)
  Arrival:=   1          0

  Setup:-       6
  Required Time:- 5
  Launch Clock:- 0
  Data Path:- 131
  Slack:= -136
#-----#
#-----# Timing Point Flags Arc Edge Cell Fanout Load Trans Delay Arrival Instance
#-----# # (ff) (ps) (ps) (ps) Location
#-----#
  RAM MEM mem_addr reg[2]/CLK - - R (arrival) 376 - 0 0 0 (-,-)
  RAM MEM mem_addr reg[2]/QN - CLK->ON R DFFH0Nx1 ASAP7_75t_SL 3 3.3 23 31 31 0 (-,-)
  g38126/Y - A->Y F INVx2 ASAP7_75t_SL 6 5.3 16 10 41 0 (-,-)
  g38001/Y - A->Y F NOR2xp3 ASAP7_75t_SL 3 1.2 12 15 56 0 (-,-)
  g37951/Y - A->Y R TWD02 ASAP7_75t_SL 2 2.9 10 7 63 0 (-,-)
  g37720 2346/Y - A->Y R ORx2 ASAP7_75t_SL 9 12.4 15 17 80 0 (-,-)
  g37633/Y - A->Y F INVx11 ASAP7_75t_SL 32 27.4 17 10 89 0 (-,-)
  g36278 5115/Y - A->Y R A0T22xp5 ASAP7_75t_SL 1 1.0 18 11 100 0 (-,-)
  g1078 - B->Y R NAND2x1 ASAP7_75t_SL 1 1.5 13 18 110 0 (-,-)
  g117/Y - B->Y R NAND2x1 ASAP7_75t_SL 1 1.2 10 5 124 0 (-,-)
  g116/Y - A->Y R NOR2xp7 ASAP7_75t_SL 1 1.0 12 7 131 0 (-,-)
  RAM MEM dataout buffer6 reg[2]/O <<< - R DFFH0Nx2 ASAP7_75t_SL 1 - - 0 131 0 (-,-)
#-----#

```

Fig. 12: Timing Violation 3.

So, we updated the timing constraint from 0.6 to 1.0 in the SDC file, and even used the parameters to test a significantly smaller 64 word memory. But it still showed timing violation; this made us question the time scale unit of the SDC, which was not mentioned in the SDC file provided in the project description. Therefore, we updated the SDC file by setting the time unit in ns using the following command – `set_units -time ns`.

```

Path 1: MET (323 ps) Late External Delay Assertion at pin dataout[3]
  Group: clk
  Startpoint: (R) RAM_MEM.mem_sel_reg[1]/CLK
  Clock: (R) clk
  Endpoint: (F) dataout[3]
  Clock: (R) clk

    Capture      Launch
  Clock Edge:+ 600          0
  Src Latency:+ 0          10
  Net Latency:+ 0 (I)     0 (I)
  Arrival:= 600          10

  Output Delay:- 150
  Uncertainty:- 10
  Required Time:- 440
  Launch Clock:- 10
  Data Path:- 107
  Slack:= 323
#-----#
#-----# Exceptions/Constraints:
#-----# output delay 150           mbist.sdc line 4
#-----#
#-----# Timing Point Flags Arc Edge Cell Fanout Load Trans Delay Arrival Instance
#-----# # (ff) (ps) (ps) (ps) Location
#-----#
  RAM MEM mem sel reg[1]/CLK - - R (arrival) 3 - 0 0 10 (-,-)
  RAM MEM mem sel reg[1]/QN - CLK->ON R DFFH0Nx1 ASAP7_75t_SL 4 2.9 21 30 40 (-,-)
  g27949 6161/Y - A->Y F NOR2xp3 ASAP7_75t_SL 2 1.8 29 17 57 0 (-,-)
  g27816 6417/Y - A->Y R INVx11 ASAP7_75t_SL 1 0.7 17 7 64 0 (-,-)
  g25846 6417/Y - B->Y F NOR2xp3 ASAP7_75t_SL 4 1.6 48 25 89 0 (-,-)
  g25835 6161/Y - A2->Y R A0T22xp5 ASAP7_75t_SL 1 0.9 23 12 102 0 (-,-)
  g25819 4319/Y - A->Y F NAND4xp25 ASAP7_75t_SL 2 1.3 30 15 117 0 (-,-)
  dataout[3] <<< - F (port) - - - 0 117 0 (-,-)
#-----#

```

Fig. 13: Timing met with clock period = 0.6ns

With this update, we were able to meet the timing for the MBIST with a clock period = 0.6ns. {Fig. 13}

### E. Stage V - Improving Clock Speed

Finally, since we were seeing a large positive slack of 323ps with the SDC update, we decided to increase the clock speed – reducing the clock period to 0.3ns.

Hence, we were able to meet timing even at 2x the required clock speed. {Fig. 14}

```

Path 1: MET (10 ps) Late External Delay Assertion at pin dataout[1]
  Group: clk
  Startpoint: (R) RAM_MEM.mem_sel_reg[2]/CLK
  Clock: (R) clk
  Endpoint: (F) dataout[1]
  Clock: (R) clk

    Capture      Launch
  Clock Edge:+ 300          0
  Src Latency:+ 0          10
  Net Latency:+ 0 (I)     0 (I)
  Arrival:= 300          10

  Output Delay:- 150
  Uncertainty:- 10
  Required Time:- 140
  Launch Clock:- 10
  Data Path:- 112
  Slack:= 18
#-----#
#-----# Exceptions/Constraints:
#-----# output_delay 150           mbist.sdc line 4_1
#-----#
#-----# Timing Point Flags Arc Edge Cell Fanout Load Trans Delay Arrival Instance
#-----# # (ff) (ps) (ps) (ps) Location
#-----#
  RAM MEM mem sel reg[2]/CLK - - R (arrival) 3 - 0 0 10 (-,-)
  RAM MEM mem sel reg[2]/QN - CLK->ON F DFFH0Nx1 ASAP7_75t_SL 3 2.3 18 26 36 (-,-)
  g28030/Y - A->Y R INVx11 ASAP7_75t_SL 2 1.5 8 6 44 0 (-,-)
  g28016/Y - B->Y F NOR2xp3 ASAP7_75t_SL 2 1.8 29 15 59 0 (-,-)
  g27885/Y - A->Y R INVx11 ASAP7_75t_SL 1 0.7 11 7 66 0 (-,-)
  g27715 9315/Y - B->Y F NOR2xp3 ASAP7_75t_L 4 3.6 52 28 94 0 (-,-)
  g2329 8428/Y - B2->Y R AND4xp25 ASAP7_75t_SL 1 0.3 23 16 100 0 (-,-)
  g25846 5197/Y - D->Y F NAND4xp25 ASAP7_75t_SL 2 1.3 31 13 122 0 (-,-)
  dataout[1] <<< - F (port) - - - 0 122 0 (-,-)
#-----#

```

Fig. 14: Timing met with 2x faster clock (period = 0.3ns)

## IV. CONCLUSION

The MBIST designed in this project successfully generated the test pattern required for testing a 256 x 4 bit SRAM cell. The RTL design met all the timing constraints mentioned in the SDC and can even perform at half the time-period mentioned in the SDC.

The entire RTL code, genus timing reports as well as run scripts and input files have been uploaded to GitHub.<sup>1</sup>

## V. ACKNOWLEDGEMENT

This project work would not have been possible without the tremendous efforts provided by the course instructor Prof. Azeez Bhavnagarwala and the Teaching Assistant, Aishwarya Raj, in developing the foundational background knowledge upon which this project stands.

<sup>1</sup><https://github.com/catchyviddy/MBIST-256x4b>