Domain-Specific Hardware Accelertors

Deep learning operations

  • A guide to convolution arithmetic for deep learning, Vincent Dumoulin, 2016 pdf_annotated
  • Inception module in GoogLeNet pdf_annotated

Review on DL acceleration

  • Matrix Computation link
  • Efficient Methods and Hardware for Deep Learning, Stanford cs231, Song Han, 2017 link
  • Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Sze, MIT pdf
  • Tutorial on Deep Learning Acceleration, Vivien Sze, MIT link
  • High-Performance Hardware for Machine Learning, W. Dally, 2015 link
  • Acceleration of Deep Learning in Algorithm & CMOS-based Hardware by me (2019)pdf

Techniques for DL acceleration

  • Data Parallelism VS Model Parallelism in Distributed Deep Learning Training link
  • One weird trick for parallelizing convolutional neural networks, Alex Krizhevsky, 2014 link
  • BinaryConnect: Training Deep Neural Networks with binary weights during propagations, Matthieu Courbariaux, 2016 link
  • Measuring the Limits of Data Parallel Training for Neural Networks, J. Lee, Mar. 2019link

Industrial trend

  • BrainWave: Accelerating Persistent Neural Networks at Datacenter Scale, Microsoft, 2017 link
  • TPU: In-datacenter performance analysis of a tensor processing unit, Google, 2017 link
  • TPU: Machine Learning for Systems and Systems for Machine Learning, Google, 2017link

Stanford CS217 reading list (2019)

  • Hardware Accelerators for Machine Learning, Stanford cs217 link
  • Is Dark Silicon Useful? by M. B. Taylor, 2012 pdf_annotated
  • Why Systolic Architecture? by H. T. Kung, 1982 pdf_annotated
  • Anatomy of High Performance Matrix Multiplication by K. Goto pdf_annotated
  • Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era by A. Pedram, 2016 pdf_annotated
  • TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning by D. Mahajan, 2016 pdf_annotated
  • Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures by A. Pedram, 2012 pdf_annotated
  • Spatial: A Language and Compiler for Application Accelerators by K. Olukotun, 2018 pdf_annotated
  • Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures by Y. S. Shao, 2014 pdf_annotated
  • Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures by D. Patterson pdf_annotated
  • In-Datacenter Performance Analysis of a Tensor Processing Unit by Google pdf_annotated
  • NVIDIA TESLA V100 GPU ARCHITECTURE pdf_annotated
  • Efficient Processing of Deep Neural Networks: A Tutorial and Survey by V. Sze pdf_annotated
  • A Systematic Approach to Blocking Convolutional Neural Networks by M. Horowitz pdf_annotated
  • Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks pdf_annotated
  • Brooks’ DL for computer architecture, See Chapter 5 pdf_annotated
  • High Performance Zero-Memory Overhead Direct Convolutions by T. Low pdf_annotated
  • Fast Algorithms for Convolutional Neural Networks (Winograd) by A. Lavin pdf_annotated
  • CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks by A. Pedram pdf_annotated
  • SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks by A. Raghunathan pdf
  • Simon Knowles: Designing Processors for Intelligence video
  • An overview of gradient descent optimization algorithms link
  • LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS by Y. You pdf
  • DSD: DENSE SPARSE DENSE TRAINING FOR DEEP NEURAL NETWORKS by S. Han pdf
  • High-Accuracy Low-Precision Training by C. D. Sa pdf
  • EIE: Efficient Inference Engine on Compressed Deep Neural Network by S. Han pdf
  • A Cloud-Scale Acceleration Architecture by Microsoft pdf
  • Serving DNNs in Real Time at Datacenter Scale with Project Brainwave by Microsoft pdf
  • DAWNBench: An End-to-End Deep Learning Benchmark and Competition by M. Zaharia pdf
  • MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms link
  • REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS by C. Luschi pdf
  • NIPS 2017 Workshop: Deep Learning At Supercomputer Scale link
  • DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING by W. Dally pdf
  • Plasticine: A Reconfigurable Architecture For Parallel Patterns by K. Olukotu pdf
  • Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective by Faceboo pdf