Helpful Papers for Learning AI Accelerators

Jong Hoon

When I first started exploring the world of AI accelerators, I found it helpful to read a set of well-written papers and project reports. They introduced me to key ideas such as roofline models, blocking strategies, dataflow mapping, sparsity, mixed precision, and real chip designs.

This list is organized by publication date, with short notes on why each work is useful. I hope it serves as a helpful guide for anyone just beginning to learn how AI accelerators work.

2009 — The Roofline Model

Berkeley Roofline: An Insightful Visual Performance Model
A classic performance model that shows the balance between memory bandwidth and compute capacity. Very approachable for visualizing bottlenecks.

2016 — Early CNN Accelerators and ISA

MIT Eyeriss (ISCA ’16)
Introduces the Row-Stationary dataflow, a gentle entry point into how data movement dominates energy use.

Cambricon ISA (ISCA ’16)
A domain-specific instruction set for neural networks, showing how hardware/software co-design starts at the ISA level.

A Systematic Approach to Blocking CNNs (2016, arXiv)
Explains how simple loop blocking can drastically reduce DRAM traffic. Clear diagrams make it easy to follow.

Cambricon-X (MICRO ’16)
Shows how sparsity can be used to cut down unnecessary operations and memory use.

2017 — Google’s TPU

In-Datacenter Performance Analysis of a TPU (ISCA ’17)
A real-world example of how Google built an inference ASIC. The paper is accessible and provides a concrete view of performance and memory limits.

2018 — Beyond Dense CNNs

Interstellar (ASPLOS ’20, preprint ’18)
Uses Halide’s scheduling language to describe different accelerator dataflows in a unified way. Helpful for beginners because it shows many designs side by side.

Cambricon-S (MICRO ’18)
Tackles the irregularity of sparse networks with a mix of software and hardware ideas.

2019 — Modeling Dataflow and Reuse

MAESTRO (MICRO ’19)
A tool that helps quantify reuse, buffer sizes, and bandwidth. A friendly starting point for understanding the notion of dataflow and mapping strategies.

2021 — Mixed Precision and Mobile NPUs

Rethinking Floating Point Overheads (MLSys ’21)
Explains how mixed precision can be supported efficiently. Good for learning why hardware doesn’t always stick to one number format.

Samsung Reconfigurable NPU (ISCA ’21)
A commercial example: how Samsung built a flexible NPU for mobile devices.

Griffin (HPCA ’22, preprint ’21)
A clear explanation of how to handle both sparse and dense models efficiently.

2024 — Meta MTIA

Meta MTIA v2 (2024 blog)
Meta’s own custom accelerator for recommendation systems. The blog post format makes it approachable without too much jargon.

2025 — Extending Rooflines for ML Accelerators

How to Keep Pushing ML Accelerator Performance? Know Your Rooflines! (2025)
An updated look at the roofline model, now applied to modern ML accelerators, covering mixed precision and new memory technologies.

Closing Thoughts

This list isn’t meant to be a set of “must-reads,” but rather a collection of papers I personally enjoyed while building my own understanding of AI accelerators. It started as my own reading list, and over time I found that these works gave me both the theory and the real-world perspective—from the early Roofline model and dataflow representations all the way to modern ASICs like TPU, Samsung’s NPU, and Meta’s MTIA.

Looking back, I think these papers made it much easier for me to take my first steps with confidence. I hope that by sharing them, they can also serve as a gentle guide for anyone just beginning their own journey into the world of AI accelerators.

Helpful Papers for Learning AI Accelerators

2009 — The Roofline Model

2016 — Early CNN Accelerators and ISA

2017 — Google’s TPU

2018 — Beyond Dense CNNs

2019 — Modeling Dataflow and Reuse

2021 — Mixed Precision and Mobile NPUs

2024 — Meta MTIA

2025 — Extending Rooflines for ML Accelerators

Closing Thoughts

Editor's Pick

Helpful Papers for Learning AI Accelerators

Multiprocessors and Thread-Level Parallelism– Notes from Hennessy & Patterson

Data-Level Parallelism – Notes from Hennessy & Patterson

Leave A Comment

Related Posts

Helpful Papers for Learning AI Accelerators

Multiprocessors and Thread-Level Parallelism– Notes from Hennessy & Patterson

Data-Level Parallelism – Notes from Hennessy & Patterson