Skip to content

zahrayousefijamarani/HBM_high_bandwidth_memory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

HBM High Bandwidth Memory

Paper Wide I/O DRAM [1]:

Problem:

As the market trend renders integration of various features in one chip, mobile DRAM is required to have not only low power consumption but also high capacity and high speed.

Solution:

Stacking of multiple Wide-I/O memories with large number of I/O pins Memory density expansion can be achieved by stacking of multiple chips using TSV (Through-Silicon Via) technology.

Architecture:

  • mono or multiple stacked Wide-I/O DRAMs:
image
  • 1 Gb Wide-I/O DRAM and SEM image of micro bumps:
image
  1. Figure shows the chip architecture with 4-channels and 16 segmented 64 Mb arrays. The whole chip is made up of 4 partitions which are symmetric with respect to the chip center, and each partition consists of 4x64 Mb arrays, peripheral circuits and micro bumps.
  2. Test pads: Direct access (DA) mode is implemented to support failure analysis in SIP type packages. In DA mode, only 32 pins are needed to test all 4 channels.
  3. All 4 channels are totally independent, and each channel can be made up of 2 or 4 banks with 128 DQs. Column addresses are fixed as CA[0:6] and each channel has 46x6 microbump:

As the image below is showing DQ0 to DQ127 is connected through bump. image

First standard of HBM (link)

Paper HBM-DRAM [2]:

problems:

One of the major obstacles is the testability. Due to small diameter, it is very difficult to directly test the microbump.

In addition, in wide I/O memory, stacked chip on system-in-package(SiP) cannot be detached. One cell or TSV failure makes the whole system failu.

Managing the power is also a problem. Large IR drop and SSO noise can cause severe trouble in transmitting and receiving data.

solutions:

In this paper, one solution to solve these problems is using base logic die on the memory chip, which supports better testability and improves reliability.

It is also advantageous to allocate PHY interface logic in base logic chip between the interposer and stacked-DRAM with thousands of TSV. The PHY located in the base logic chip of HBM decreases the repairs the chip-to-chip connection failure, and provides power redistribution.

image

Architecture:

  • The fundamental structure of HBM is composed of 4-Hi (high) core DRAM and base logic die at the bottom.

  • The core DRAM consists of two channels, where each channel has 1 Gb density with 128 I/Os and eight independent banks.

  • Each channel of core DRAM die has independent address and data TSV with point-to-point (P2P) connection, isolating the operation of every channel.

  • The power and ground of each channel are not isolated and have common plane. image

  • Ballout area is located at the center as a blue rectangle. The size of the ballout is about 6 mm 3 mm.

  • PHY (Physical layer) located at the top of the die is for the main interface between DRAM and memory controller.

  • PHY has a total of eight channels where a channel consists of an AWORD and four channel-interleaved DWORD. A total of eight AWORD and 32 DWORD are located in the PHY area.

  • MBIST (Memory BIST) and IEEE1500, RTL-based circuit, for the purpose of reliability and testing is located at the bottom.

    image

Paper HBM-GEN2(HBM2) [3]:

Problems:

higher bandwidth

Solutions:

double the bandwidth from 128GB/s to more than 256GB/s

support pseudo-channel mode and 8H stacks

image image

Architecture:

  • In the pseudo-channel

  • A legacy channel is divided into two pseudo channels and the two pseudo channels share the command-address pins. Thus, one HBM has 16 pseudo channels instead of 8 legacy channels.

  • In 4H/8H case the HBM is composed of two channels and each channel has two pseudo channels (PC0/PC1) which consists of 16 banks (4 bank-groups and 4banks per group).

  • 1 channel = 2 pseudo channel = 2x(4 bank groups) = 2x(4x(4 banks)) = 32 banks

  • 128 bit bandwidth per each bank

  • In case of 2H, one pseudo channel is divided into two different channels and each channel has eight banks which is necessary to keep the same bandwidth as in 4H/8H case.

    image

Paper HBM-PIM [4]:

Problems:

The energy consumption from interconnections limits the scaling of the system performance, because the on-chip interconnection is increased by the tremendous accelerator size and the power overhead is caused by DRAM bandwidth expansion.

Solutions:

Processing-in-memory (PIM) architecture.

image

Architectuure:

  • Embedding processing units into a logic base.

  • PIM core for the proposed PIM-HBM architecture is assumed to be a streaming multiprocessor (SM), which is a unit core of a graphics processing unit (GPU).

    image image

SM Architecture [5]:

image

Paper HBM2E [6]:

Problems:

higher bandwidth

Solutions:

increase its bandwidth up to 640GB/s (5Gb/s/pin)

stable bit-cell operation

Architecture:

image

Can access through this link:

image

Paper FIM-DRAM [7]:

problems:

GPUs and TPUs can only operate at their peak performance when they get the necessary data from memory as quickly as it is processed: requiring off-chip memory with a high bandwidth and a large capacity. HBM has thus far met the bandwidth and capacity requirement, but recent AI technologies such as recurrent neural networks require an even higher bandwidth than HBM.

Further increase in off-chip bandwidth is often limited by power constraints.

solution:

Decrease demand for off-chip bandwidth with unconventional architectures: such as processing-in-memory.

Integrating a 16-wide single-instruction multiple-data engine within the memory banks.

Architecture:

  • Half of the cell array in each bank was removed and replaced with the programmable computing unit (PCU).
  • Two banks share one PCU, and there are 8 PCUs per pseudo-channe
image
  • According to the table below, in normal mode we have read and write operations. But in FIM mode, we can move data from PCU block to the cell and from cell to PCU block, or write into the PCU register directly.
image
  • Architecture of PCU is shown below:

    • The register group consists of:
      1. A command-register file (CRF) for instruction memory
      2. A general-purpose register file (GRF) for weight and accumulation
      3. A scalar register file (SRF) to store constants for MAC operation
    • The PCU is controlled by conventional memory commands from the host via the newly implemented control paths to enable the in-DRAM computations.
    • The execution unit in the PCU uses a variable four-stage pipeline operation.
    • The execution unit has a one stage pipeline for the JUMP instruction, which supports looping.
    • Computed results from the execution units are directed by the destination operand and stored in the GRF.
    • The PCU can save 32 instructions in the CRF
    image
  • Flow of FIM-DRAM:

    image

Paper HBM3 [8]:

Problem:

It is notable that, despite the thousandfold increase in the system-level performance during the defined decade, the increase rate of memory was limited to 125x over the same course of time. image

Architecture:

  • 12-high die stack and 16 independent channels.
  • The proposed architecture includes four channels per core die with four slices forming a single unit rank, or a total of 12 slices corresponding to three ranks.
  • A unit channel is subdivided into two pseudo-channels where each pseudo-channel consists of four bank-grouped 16 banks with 160-Mb density, including 16 Mb of parity cells for ECC and 8 Mb of metadata cells for a system-level reliability check.
  • Each bank group has 272 dedicated I/O lines, and four sets of BK IOs share two sets of global lines located in the TSV area with time-division multiplexing.
image image

Paper HBM3 [9]:

Problem:

performance and reliability

Architecture:

image image

Refrences:

[1] Kim JS, Oh CS, Lee H, Lee D, Hwang HR, Hwang S, Na B, Moon J, Kim JG, Park H, Ryu JW. A 1.2 V 12.8 GB/s 2 Gb mobile wide-I/O DRAM with 4x128 I/Os using TSV based stacking. IEEE Journal of Solid-State Circuits. 2011 Sep 23;47(1):107-16.

[2] Lee DU, Kim KW, Kim KW, Lee KS, Byeon SJ, Kim JH, Cho JH, Lee J, Chun JH. A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM with effective I/O test circuits. IEEE Journal of Solid-State Circuits. 2014 Oct 14;50(1):191-203.

[3] Sohn K, Yun WJ, Oh R, Oh CS, Seo SY, Park MS, Shin DH, Jung WC, Shin SH, Ryu JM, Yu HS. A 1.2 V 20 nm 307 GB/s HBM DRAM with at-speed wafer-level IO test scheme and adaptive refresh considering temperature distribution. IEEE Journal of Solid-State Circuits. 2016 Sep 13;52(1):250-60.

[4] Kim S, Kim S, Cho K, Shin T, Park H, Lho D, Park S, Son K, Park G, Kim J. Processing-in-memory in high bandwidth memory (PIM-HBM) architecture with energy-efficient and low latency channels for high bandwidth system. In2019 IEEE 28th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS) 2019 Oct 6 (pp. 1-3). IEEE.

[5] NVIDIA, Tesla. "V100 GPU architecture." (2017) (link).

[6] Oh CS, Chun KC, Byun YY, Kim YK, Kim SY, Ryu Y, Park J, Kim S, Cha S, Shin D, Lee J. 22.1 A 1.1 V 16GB 640GB/s HBM2E DRAM with a data-bus window-extension technique and a synergetic on-die ECC scheme. In2020 IEEE International Solid-State Circuits Conference-(ISSCC) 2020 Feb 16 (pp. 330-332). IEEE.

[7] Oh CS, Chun KC, Byun YY, Kim YK, Kim SY, Ryu Y, Park J, Kim S, Cha S, Shin D, Lee J. 22.1 A 1.1 V 16GB 640GB/s HBM2E DRAM with a data-bus window-extension technique and a synergetic on-die ECC scheme. In2020 IEEE International Solid-State Circuits Conference-(ISSCC) 2020 Feb 16 (pp. 330-332). IEEE.

[8] Park MJ, Lee J, Cho K, Park J, Moon J, Lee SH, Kim TK, Oh S, Choi S, Choi Y, Cho HS. A 192-Gb 12-high 896-Gb/s HBM3 DRAM with a TSV auto-calibration scheme and machine-learning-based layout optimization. IEEE Journal of Solid-State Circuits. 2022 Aug 17;58(1):256-69.

[9] Ryu Y, Ahn SG, Lee JH, Park J, Kim YK, Kim H, Song YG, Cho HW, Cho S, Song SH, Lee H. A 16 GB 1024 GB/s HBM3 DRAM With Source-Synchronized Bus Design and On-Die Error Control Scheme for Enhanced RAS Features. IEEE Journal of Solid-State Circuits. 2023 Jan 4;58(4):1051-61.

Definitions:

  • TSV: In electronic engineering, a through-silicon via (TSV) or through-chip via is a vertical electrical connection (via) that passes completely through a silicon wafer or die.
  • DQ : DQ is a label for data pins in the DDR circuitry. The DQ pins are used for input and output during a write operation. The memories data bus is bidirectional, so a data pin can be D when it is input or Q when it is output, hence the name DQ3.

About

Repo of high bandwidth memory papers, codes, ...

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published