Jong Hoon

“Ideally one would desire an indefinitely large memory capacity such that any particular… word would be immediately available… We are… forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than the preceding but which is less quickly accessible.”

– A. W. Burks, H. H. Goldstine, and J. von Neumann, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946).

In this post, I summarize Chapter 2—Memory Hierarchy Design—from Computer Architecture: A Quantitative Approach by Patterson and Hennessy.

Introduction to Memory Hierarchy

Modern computer systems aim to balance performance and cost in the face of inherent memory limitations. Ideally, a system would have unlimited fast memory, but this is not practically achievable due to cost, power, and physical constraints. Instead, architects build a memory hierarchy that optimizes for access time, bandwidth, and efficiency.

Core Ideas of Memory Hierarchy

The design of memory systems takes advantage of the principle of locality:
- Temporal locality: Recently accessed data is likely to be used again soon.
- Spatial locality: Data near recently accessed addresses is likely to be used soon.
- These allow for non-uniform memory access patterns that are still predictable.
- The hierarchy exploits cost-performance trade-offs in memory technologies:
  - Fast memory (e.g., SRAM) is expensive per byte.
  - Cheap memory (e.g., DRAM, SSD) is slower.
- Goal: Create a system that is:
  - Almost as cheap as the lowest-cost memory.
  - Almost as fast as the fastest memory.

Inclusion Property

In most hierarchies, lower levels (like L3 or main memory) contain all the data from higher levels (L1/L2).
This inclusion property simplifies consistency and coherency management, though it is not universal.

The Performance Gap Problem

Since 1980, the performance of processors has dramatically outpaced memory latency improvements.
This growing gap makes efficient memory hierarchy design crucial.
For example, in 2017, companies like AMD, Intel, and NVIDIA began deploying High Bandwidth Memory (HBM) to reduce this gap.

Figure 2.2. Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access (assuming a single DRAM and a single memory bank), is plotted over time. In mid-2017, AMD, Intel and Nvidia all announced chip sets using versions of HBM technology.

Bandwidth Demands in Multicore Systems

Consider the Intel Core i7-6700 as a real-world case:

Each core generates two memory references per core-cycle
With 4 cores at 4.2 GHz:
- 32.8 billion 64-bit data refs/sec = ~244 GB/sec
- 12.8 billion 128-bit instruction refs/sec = ~191 GB/sec
- Total theoretical demand: Over 400 GiB/sec
Actual DRAM bandwidth is much lower: ~32.1 GiB/sec
The memory system must bridge this massive bandwidth mismatch.

Hardware Solutions to Bandwidth and Latency

Multi-porting and pipelining of caches
Three-level cache hierarchy:
- Private L1 & L2 per core
- Shared L3
- Some systems include L4 DRAM caches (stacked or embedded DRAM)
Split L1 for instruction and data

Design Focus Areas

Designers aim to optimize:

Average Memory Access Time (AMAT)
Cache access time
Miss rate
Miss penalty
Power consumption:
- Static power (leakage)
- Dynamic power (access, switching)

Basics of Memory Hierarchies: A Quick Review

Cache Concepts:
- Hit / Miss, Block/Line, Tag
- Spatial & Temporal Locality
- Set associativity:
  - Direct-mapped, n-way set associative, Fully-associative
Write Policies:
- Write-through: Write to memory on every store
- Write-back: Write to memory only when evicted
- Write buffers: Hide write latency
Three Cs of Misses:
- Compulsory: First access to a block
- Capacity: Cache can’t hold all active data
- Conflict: Limited associativity causes evictions
4th C: Coherency – Arises in multithreaded/multicore systems
- Covered in Chapter 5, “Multiprocessors and Thread-Level Parallelism”
Misses per instruction is a better measure than miss rate alone

Best metric: Average Memory Access Time (AMAT)

Latency Tolerance Techniques

Modern processors use various techniques to hide or tolerate memory latency:

Speculative execution
Multithreading

Six Key Cache Optimizations

Larger block size – reduces compulsory misses
Bigger caches – reduces capacity misses
Higher associativity – reduces conflict misses
Multi-level caches – hides miss penalty from higher levels
Prioritizing read misses over writes – improves perceived performance, often by using write buffers that support non-blocking stores and mitigate RAW hazards by forwarding data from the buffer.
Avoid address translation (e.g., TLB lookups) during cache indexing to reduce hit time

Memory Technology and Optimization

Modern memory systems aim to balance latency, bandwidth, power, and cost. This section reviews the underlying memory technologies and key architectural strategies for improving memory performance.

Memory Latency Metrics

Access Time: Time from sending a read request to when the data arrives.
Cycle Time: Minimum time between two independent memory requests.

1. SRAM Technology

Static RAM (SRAM) is fast, does not require refresh, and has nearly equal access and cycle times.

Structure: 6 transistors per bit.
Usage: Typically used for caches.
- Previously in separate chips.
- Now implemented as on-chip memory, e.g., up to 60MB shared cache for 24-core CPUs.

Typical Cache Access Latency and Capacity (Server Scale):

Level	Access Time	Size
L1	~1 ns	~64 KB
L2	3–10 ns	~256 KB
L3	10–20 ns	~16–64 MB
DRAM	50–100 ns	32–256 GB

Why Larger Caches Are Slower and Power-Hungry:

Access time ∝ number of blocks (except hit detection and selection in a set associative cache)
Static power ∝ number of bits
Dynamic power ∝ number of blocks

Six-transistor SRAM from UC Berkeley Lecture

General SRAM structure from UC Berkeley Lecture

Address decoder structure from UC Berkeley Lecture

Building larger memories from UC Berkeley Lecture

6T SRAM Cell from UC Berkeley Lecture

2. DRAM Technology

Dynamic RAM (DRAM) uses capacitors and requires regular refreshing.

1 transistor + 1 capacitor per bit
- Sensing half high charge = 1
- Sensing half low charge = 0
Addressing: Split into two stages
- The first half of the address: RAS (Row Access Strobe) for row
- The other half: CAS (Column Access Strobe) for column
Reading:
- Accessing a row (RAS) in DRAM loads it into a row buffer, after which CAS signals select the desired columns to be read.
- Reading a row destroys its content, which must be written back if not used
Refreshing:
- To maintain data integrity, DRAM cells must periodically refill the charges that naturally leak over time.
- This occurs once every 64 ms and consumes less than 5% of total memory access time.

3. Improving DRAM Performance: SDRAMs

Synchronous DRAM (SDRAM) improves coordination with memory controllers using a clock signal, while previous version of DRAMs worked asynchronously so that they had overhead to synchronize with the controllers.

Burst transfer mode allows multiple words to be accessed per request
More bandwidth with wider DRAM interfaces, such as 4-bit, 8-bit, or 16-bit buses in DDR2 and DDR3, help increase memory bandwidth.
DDR (Double Data Rate) DRAM transfers data on both rising and falling clock edges
Multiple banks enable:
- Interleaved and overlapped access to different banks
- Improved access time
- Better power management
- Address = Bank number + Row address + Column address
Access Steps in SDRAMs:
- When accessing DRAM, the memory controller first activates a bank and a specific row (via the Row Address Strobe, or RAS),
- followed by the column address to select the desired data.
- Depending on the request, the system may read a single item or initiate a burst transfer for multiple items.
Pre-charge operation
- After accessing a row, the DRAM requires a pre-charge operation to close the current row before accessing another row in the same bank, which introduces a pre-charge delay.
- However, if subsequent accesses target different rows in different banks, they can be overlapped, avoiding this delay and improving memory throughput.

DIMMs (Dual Inline Memory Modules) group 4–16 DRAM chips
- Typical width: 8 bytes + ECC
Power Saving Techniques:
- Lower voltages (e.g., 1.2V in DDR4)
- Multi-bank: Only one row active per bank
- Power-down mode (clock gated except for refresh)

4. Graphics DRAMs (GDDR)

Graphics DRAMs, special class of DRAMs based on SDRAM designs, are optimized for bandwidth, not latency.

GDDR5 design is based on DDR3
Wider interface (32-bit vs. 4, 8, 16-bit typical designs)
Directly soldered to GPU (not expandable)
2–5× bandwidth vs DDR3

5. Packaging Innovation: HBM

High Bandwidth Memory (HBM) integrates DRAM closely with compute chips.

Stacked or embedded DRAM in the same package
Lower latency
Higher bandwidth
Successor to GDDR5

Packaging Techniques:

3D Stacking
- DRAM dies on top of CPU die using solder bumps
- Requires heat dissipation
- e.g., 8 chips, 8 GiB, 1 TB/s
2.5D Stacking
- DRAM stacked on interposer substrate
- Less thermal coupling, still high BW

6. Flash Memory

Flash is a form of non-volatile EEPROM.

NAND Flash is higher density than NOR Flash
Drawbacks:
- Sequential access (vs. random)
- Slow writes (erase-before-write, block-level erasure)

Flash vs DRAM:

Feature	Flash	DRAM
Read Latency (2 KiB)	~75 µs (300× faster than HDD)	~500 ns (150× faster than Flash)
Write Latency	1500× DRAM (Erase before overwrite)	Fast
Nonvolatile	Yes	No
Write cycles	~100,000	Unlimited
Cost	~10× cheaper	More expensive

Flash Controller Responsibilities:

Redundant block mapping
Page-level transfer and caching
Write leveling (to extend lifespan)

7. Phase-Change Memory (PCM)

Phase-change memory uses heat to switch between crystalline and amorphous states.

Each bit = 2D crosspoint (memristor)
Read: Measure resistance
Write: Apply current to change phase

Intel & Micron’s XPoint memory (2017) offered:

Much better write endurance than Flash
2–3× faster reads

8. Enhancing Memory System Dependability

Memory errors can be hard (permanent) or soft (transient).

ECC and Protection:

I$ (Instruction cache) → Parity bit only
D$ + Main Memory → Error Correction Code (ECC)
Typical overhead: 1 ECC bit per 8 data bits (e.g., 64-bit word → 8 ECC bits)

Chipkill:

Redundant ECC scheme similar to RAID
Can recover from entire chip failure
Used in IBM, SUN, Google clusters
Intel calls it SDDC

Example Error Rates (10,000 nodes, 4GiB each):

Protection	Unrecoverable Errors
Parity Only	Every ~17 minutes
ECC Only	Every ~7.5 hours
Chipkill	Every ~2 months

Reference

Chapter 2 in Computer Architecture A Quantitative Approach (6th) by Hennessy and Patterson (2017)
CS250 VLSI Systems Design Lecture 8: Memory by John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA), CS250, UC Berkeley, Fall 2010
EE241 – Spring 2011 Advanced Digital Integrated Circuits Lecture 9: SRAM

Memory Hierarchy Design 1 – Notes from Hennessy & Patterson