# Domain-Specific Hardware Accelertors

- Deep learning operations
- Review on DL acceleration
- Techniques for DL acceleration
- Industrial trend
- Stanford CS217 reading list (2019)

## Deep learning operations

- A guide to convolution arithmetic for deep learning, Vincent Dumoulin, 2016 pdf_annotated
- Inception module in GoogLeNet pdf_annotated

## Review on DL acceleration

- Matrix Computation link
- Efficient Methods and Hardware for Deep Learning, Stanford cs231, Song Han, 2017 link
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Sze, MIT pdf
- Tutorial on Deep Learning Acceleration, Vivien Sze, MIT link
- High-Performance Hardware for Machine Learning, W. Dally, 2015 link
- Acceleration of Deep Learning in Algorithm & CMOS-based Hardware by me (2019)pdf

## Techniques for DL acceleration

- One weird trick for parallelizing convolutional neural networks, Alex Krizhevsky, 2014 link
- BinaryConnect: Training Deep Neural Networks with binary weights during propagations, Matthieu Courbariaux, 2016 link
- Measuring the Limits of Data Parallel Training for Neural Networks, J. Lee, Mar. 2019link

## Industrial trend

- BrainWave: Accelerating Persistent Neural Networks at Datacenter Scale, Microsoft, 2017 link
- TPU: In-datacenter performance analysis of a tensor processing unit, Google, 2017 link
- TPU: Machine Learning for Systems and Systems for Machine Learning, Google, 2017link

## Stanford CS217 reading list (2019)

- Hardware Accelerators for Machine Learning, Stanford cs217 link
- Is Dark Silicon Useful? by M. B. Taylor, 2012 pdf_annotated
- Why Systolic Architecture? by H. T. Kung, 1982 pdf_annotated
- Anatomy of High Performance Matrix Multiplication by K. Goto pdf_annotated
- Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era by A. Pedram, 2016 pdf_annotated
- TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning by D. Mahajan, 2016 pdf_annotated
- Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures by A. Pedram, 2012 pdf_annotated
- Spatial: A Language and Compiler for Application Accelerators by K. Olukotun, 2018 pdf_annotated
- Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures by Y. S. Shao, 2014 pdf_annotated
- Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures by D. Patterson pdf_annotated
- In-Datacenter Performance Analysis of a Tensor Processing Unit by Google pdf_annotated
- NVIDIA TESLA V100 GPU ARCHITECTURE pdf_annotated
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey by V. Sze pdf_annotated
- A Systematic Approach to Blocking Convolutional Neural Networks by M. Horowitz pdf_annotated
- Eyeriss: A Spatial Architecture for Energy-Efﬁcient Dataﬂow for Convolutional Neural Networks pdf_annotated
- Brooks’ DL for computer architecture, See Chapter 5 pdf_annotated
- High Performance Zero-Memory Overhead Direct Convolutions by T. Low pdf_annotated
- Fast Algorithms for Convolutional Neural Networks (Winograd) by A. Lavin pdf_annotated
- CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks by A. Pedram pdf_annotated
- SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks by A. Raghunathan pdf
- Simon Knowles: Designing Processors for Intelligence video
- An overview of gradient descent optimization algorithms link
- LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS by Y. You pdf
- DSD: DENSE SPARSE DENSE TRAINING FOR DEEP NEURAL NETWORKS by S. Han pdf
- High-Accuracy Low-Precision Training by C. D. Sa pdf
- EIE: Efficient Inference Engine on Compressed Deep Neural Network by S. Han pdf
- A Cloud-Scale Acceleration Architecture by Microsoft pdf
- Serving DNNs in Real Time at Datacenter Scale with Project Brainwave by Microsoft pdf
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition by M. Zaharia pdf
- MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms link
- REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS by C. Luschi pdf
- NIPS 2017 Workshop: Deep Learning At Supercomputer Scale link
- DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING by W. Dally pdf
- Plasticine: A Reconfigurable Architecture For Parallel Patterns by K. Olukotu pdf
- Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective by Faceboo pdf