Domain-Specific Hardware Accelertors

Deep learning operations

  • A guide to convolution arithmetic for deep learning, Vincent Dumoulin, 2016 pdf_annotated
  • Inception module in GoogLeNet pdf_annotated

Review on DL acceleration

  • Matrix Computation link
  • Efficient Methods and Hardware for Deep Learning, Stanford cs231, Song Han, 2017 link
  • Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Sze, MIT pdf
  • Tutorial on Deep Learning Acceleration, Vivien Sze, MIT link
  • High-Performance Hardware for Machine Learning, W. Dally, 2015 link
  • Acceleration of Deep Learning in Algorithm & CMOS-based Hardware by me (2019)pdf

Techniques for DL acceleration

  • One weird trick for parallelizing convolutional neural networks, Alex Krizhevsky, 2014 link
  • BinaryConnect: Training Deep Neural Networks with binary weights during propagations, Matthieu Courbariaux, 2016 link
  • Measuring the Limits of Data Parallel Training for Neural Networks, J. Lee, Mar. 2019link

Industrial trend

  • BrainWave: Accelerating Persistent Neural Networks at Datacenter Scale, Microsoft, 2017 link
  • TPU: In-datacenter performance analysis of a tensor processing unit, Google, 2017 link
  • TPU: Machine Learning for Systems and Systems for Machine Learning, Google, 2017link

Stanford CS217 reading list (2019)

  • Hardware Accelerators for Machine Learning, Stanford cs217 link
  • Is Dark Silicon Useful? by M. B. Taylor, 2012 pdf_annotated
  • Why Systolic Architecture? by H. T. Kung, 1982 pdf_annotated
  • Anatomy of High Performance Matrix Multiplication by K. Goto pdf_annotated
  • Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era by A. Pedram, 2016 pdf_annotated
  • TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning by D. Mahajan, 2016 pdf_annotated
  • Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures by A. Pedram, 2012 pdf_annotated
  • Spatial: A Language and Compiler for Application Accelerators by K. Olukotun, 2018 pdf_annotated
  • Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures by Y. S. Shao, 2014 pdf_annotated
  • Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures by D. Patterson pdf_annotated
  • In-Datacenter Performance Analysis of a Tensor Processing Unit by Google pdf_annotated
  • NVIDIA TESLA V100 GPU ARCHITECTURE pdf_annotated
  • Efficient Processing of Deep Neural Networks: A Tutorial and Survey by V. Sze pdf_annotated
  • A Systematic Approach to Blocking Convolutional Neural Networks by M. Horowitz pdf_annotated
  • Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks pdf_annotated
  • Brooks’ DL for computer architecture, See Chapter 5 pdf_annotated
  • High Performance Zero-Memory Overhead Direct Convolutions by T. Low pdf_annotated
  • Fast Algorithms for Convolutional Neural Networks (Winograd) by A. Lavin pdf_annotated
  • CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks by A. Pedram pdf_annotated
  • SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks by A. Raghunathan pdf
  • Simon Knowles: Designing Processors for Intelligence video
  • An overview of gradient descent optimization algorithms link
  • LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS by Y. You pdf
  • DSD: DENSE SPARSE DENSE TRAINING FOR DEEP NEURAL NETWORKS by S. Han pdf
  • High-Accuracy Low-Precision Training by C. D. Sa pdf
  • EIE: Efficient Inference Engine on Compressed Deep Neural Network by S. Han pdf
  • A Cloud-Scale Acceleration Architecture by Microsoft pdf
  • Serving DNNs in Real Time at Datacenter Scale with Project Brainwave by Microsoft pdf
  • DAWNBench: An End-to-End Deep Learning Benchmark and Competition by M. Zaharia pdf
  • MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms link
  • REVISITING SMALL BATCH TRAINING FOR DEEP NEURAL NETWORKS by C. Luschi pdf
  • NIPS 2017 Workshop: Deep Learning At Supercomputer Scale link
  • DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING by W. Dally pdf
  • Plasticine: A Reconfigurable Architecture For Parallel Patterns by K. Olukotu pdf
  • Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective by Faceboo pdf

This is a preview of Clap Button, a new feedback and analytics tools for Hydejack, built by yours truly. You can try it out on localhost for free, but it will be removed (together with this message) when building with JEKYLL_ENV=production. To use Clap Button on your site, get a subscription
and set clap_button: true in your config file.