Domain-Specific Hardware Accelertors

Deep learning operations
Review on DL acceleration
Techniques for DL acceleration
Industrial trend
Stanford CS217 reading list (2019)

Deep learning operations

A guide to convolution arithmetic for deep learning, Vincent Dumoulin, 2016 pdf_annotated
Inception module in GoogLeNet pdf_annotated

Review on DL acceleration

Matrix Computation link
Efficient Methods and Hardware for Deep Learning, Stanford cs231, Song Han, 2017 link
Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Sze, MIT pdf
Tutorial on Deep Learning Acceleration, Vivien Sze, MIT link
High-Performance Hardware for Machine Learning, W. Dally, 2015 link
Acceleration of Deep Learning in Algorithm & CMOS-based Hardware by me (2019)pdf

Techniques for DL acceleration

Data Parallelism VS Model Parallelism in Distributed Deep Learning Training link
One weird trick for parallelizing convolutional neural networks, Alex Krizhevsky, 2014 link
BinaryConnect: Training Deep Neural Networks with binary weights during propagations, Matthieu Courbariaux, 2016 link
Measuring the Limits of Data Parallel Training for Neural Networks, J. Lee, Mar. 2019link

Industrial trend

BrainWave: Accelerating Persistent Neural Networks at Datacenter Scale, Microsoft, 2017 link
TPU: In-datacenter performance analysis of a tensor processing unit, Google, 2017 link
TPU: Machine Learning for Systems and Systems for Machine Learning, Google, 2017link

Stanford CS217 reading list (2019)

Hardware Accelerators for Machine Learning, Stanford cs217 link
Is Dark Silicon Useful? by M. B. Taylor, 2012 pdf_annotated
Why Systolic Architecture? by H. T. Kung, 1982 pdf_annotated
Anatomy of High Performance Matrix Multiplication by K. Goto pdf_annotated
Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era by A. Pedram, 2016 pdf_annotated
TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning by D. Mahajan, 2016 pdf_annotated
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures by A. Pedram, 2012 pdf_annotated
Spatial: A Language and Compiler for Application Accelerators by K. Olukotun, 2018 pdf_annotated
Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures by Y. S. Shao, 2014 pdf_annotated
Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures by D. Patterson pdf_annotated
In-Datacenter Performance Analysis of a Tensor Processing Unit by Google pdf_annotated
NVIDIA TESLA V100 GPU ARCHITECTURE pdf_annotated
Efficient Processing of Deep Neural Networks: A Tutorial and Survey by V. Sze pdf_annotated
A Systematic Approach to Blocking Convolutional Neural Networks by M. Horowitz pdf_annotated
Eyeriss: A Spatial Architecture for Energy-Efﬁcient Dataﬂow for Convolutional Neural Networks pdf_annotated
Brooks’ DL for computer architecture, See Chapter 5 pdf_annotated
High Performance Zero-Memory Overhead Direct Convolutions by T. Low pdf_annotated
Fast Algorithms for Convolutional Neural Networks (Winograd) by A. Lavin pdf_annotated
CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks by A. Pedram pdf_annotated
SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks by A. Raghunathan pdf
Simon Knowles: Designing Processors for Intelligence video
An overview of gradient descent optimization algorithms link
LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS by Y. You pdf
DSD: DENSE SPARSE DENSE TRAINING FOR DEEP NEURAL NETWORKS by S. Han pdf
High-Accuracy Low-Precision Training by C. D. Sa pdf
EIE: Efficient Inference Engine on Compressed Deep Neural Network by S. Han pdf
A Cloud-Scale Acceleration Architecture by Microsoft pdf
Serving DNNs in Real Time at Datacenter Scale with Project Brainwave by Microsoft pdf
DAWNBench: An End-to-End Deep Learning Benchmark and Competition by M. Zaharia pdf
MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms link
REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS by C. Luschi pdf
NIPS 2017 Workshop: Deep Learning At Supercomputer Scale link
DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING by W. Dally pdf
Plasticine: A Reconfigurable Architecture For Parallel Patterns by K. Olukotu pdf
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective by Faceboo pdf

Domain-Specific Hardware Accelertors

Deep learning operations

Review on DL acceleration

Techniques for DL acceleration

Industrial trend

Stanford CS217 reading list (2019)

jonghoon.blog

Error

Deep learning operations

Review on DL acceleration

Techniques for DL acceleration

Industrial trend

Stanford CS217 reading list (2019)

Templates (for web app):

Error