Huizi Mao’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (22)


Fig. 1. EIE opened a new opportunity to build hardware accelerator for sparse and compressed neural networks.
Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network
  • Preprint
  • File available

June 2023

·

193 Reads

·

Xingyu Liu

·

Huizi Mao

·

[...]

·

EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators. Since published in ISCA'16, it opened a new design space to accelerate pruned and sparse neural networks and spawned many algorithm-hardware co-designs for model compression and acceleration, both in academia and commercial AI chips. In retrospect, we review the background of this project, summarize the pros and cons, and discuss new opportunities where pruning, sparsity, and low precision can accelerate emerging deep learning workloads.

Download


BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

May 2022

·

337 Reads

·

1 Citation

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost.


Figure 1: For short-range template matching, PatchNet combines the simplicity of correlation filter methods and the learnability of Siamese networks. PatchNet works by fitting a very efficient CNN on a patch-wise correlation map instead of image pixels.
Figure 2: Similar low-level features widely exist in consecutive frames, although they may move to different relative positions. This type of redundancy inspires a new approach to reduce computation -learning a CNN on correlation features instead of directly on pixels.
Figure 4: Localization accuracy versus scale for 1000 randomly selected ImageNet-VID objects. PatchNet works on a larger range of scales than full template correlation.
Figure 8: The building block of aggregation subnet. The lower localization path is normal conv-pool layers, while the upper bounding box regression path actively adds the partial offset from the pooling stage. Input offset to the first block is set to zero.
Results on UAV dataset with SiamRPN-2x as the baseline model.
PatchNet -- Short-range Template Matching for Efficient Video Processing

March 2021

·

276 Reads

Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition. We propose PatchNet, an efficient convolutional neural network to match objects in adjacent video frames. It learns the patchwise correlation features instead of pixel features. PatchNet is very compact, running at just 58MFLOPs, 5×5\times simpler than MobileNetV2. We demonstrate its application on two tasks, video object detection and visual object tracking. On ImageNet VID, PatchNet reduces the flops of R-FCN ResNet-101 by 5x and EfficientDet-D0 by 3.4x with less than 1% mAP loss. On OTB2015, PatchNet reduces SiamFC and SiamRPN by 2.5x with no accuracy loss. Experiments on Jetson Nano further demonstrate 2.8x to 4.3x speed-ups associated with flops reduction. Code is open sourced at https://github.com/RalphMao/PatchNet.


A Delay Metric for Video Object Detection: What Average Precision Fails to Tell

August 2019

·

132 Reads

Average precision (AP) is a widely used metric to evaluate detection accuracy of image and video object detectors. In this paper, we analyze object detection from videos and point out that AP alone is not sufficient to capture the temporal nature of video object detection. To tackle this problem, we propose a comprehensive metric, average delay (AD), to measure and compare detection delay. To facilitate delay evaluation, we carefully select a subset of ImageNet VID, which we name as ImageNet VIDT with an emphasis on complex trajectories. By extensively evaluating a wide range of detectors on VIDT, we show that most methods drastically increase the detection delay but still preserve AP well. In other words, AP is not sensitive enough to reflect the temporal characteristics of a video object detector. Our results suggest that video object detection methods should be additionally evaluated with a delay metric, particularly for latency-critical applications such as autonomous vehicle perception.


CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video

September 2018

·

90 Reads

Detecting objects in a video is a compute-intensive task. In this paper we propose CaTDet, a system to speedup object detection by leveraging the temporal correlation in video. CaTDet consists of two DNN models that form a cascaded detector, and an additional tracker to predict regions of interests based on historic detections. We also propose a new metric, mean Delay(mD), which is designed for latency-critical video applications. Experiments on the KITTI dataset show that CaTDet reduces operation count by 5.1-8.7x with the same mean Average Precision(mAP) as the single-model Faster R-CNN detector and incurs additional delay of 0.3 frame. On CityPersons dataset, CaTDet achieves 13.0x reduction in operations with 0.8% mAP loss.



Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

December 2017

·

617 Reads

·

1,134 Citations

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.



Table 1 : Comparison of accuracies with the same density/sparsity.
Figure 2: Example of Sub-kernel Vector, Filter and Kernel. 
Figure 4: Accuracy-Sparsity Curve of AlexNet obtained by iterative pruning. 
Figure 8: A simplified dataflow of SCNN architecture. Weights and activations are both stored in sparse format. Bypass is possible when the same output address is referenced again. 
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

May 2017

·

1,428 Reads

·

183 Citations

Sparsity helps reduce the computational complexity of deep neural networks by skipping zeros. Taking advantage of sparsity is listed as a high priority in next generation DNN accelerators such as TPU. The structure of sparsity, i.e., the granularity of pruning, affects the efficiency of hardware accelerator design as well as the prediction accuracy. Coarse-grained pruning creates regular sparsity patterns, making it more amenable for hardware acceleration but more challenging to maintain the same accuracy. In this paper we quantitatively measure the trade-off between sparsity regularity and prediction accuracy, providing insights in how to maintain accuracy while having more a more structured sparsity pattern. Our experimental results show that coarse-grained pruning can achieve a sparsity ratio similar to unstructured pruning without loss of accuracy. Moreover, due to the index saving effect, coarse-grained pruning is able to obtain a better compression ratio than fine-grained sparsity at the same accuracy threshold. Based on the recent sparse convolutional neural network accelerator (SCNN), our experiments further demonstrate that coarse-grained sparsity saves about 2x the memory references compared to fine-grained sparsity. Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.


Citations (16)


... Specifically, we focus on multi-modal 3D perception models because public datasets today are collected by different sensors. Furthermore, we note that state-of-the-art 3D perception models on most tasks and benchmarks such as BEV-Fusion [33], CMT [72], SparseLIF [81] and UniTR [58] are all multimodal models that combine camera images and LiDAR point clouds. Thus, our work focuses on LiDARcamera fusion-based 3D perception models. ...

Reference:

Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
  • Citing Conference Paper
  • May 2023

... Attack under Defenses. To evaluate the attack effectiveness under defenses, we apply two defense strategies and test LFBA on four datasets: gradient compression (Shokri and Shmatikov 2015;Lin et al. 2018;Fu et al. 2022), and adding Gaussian noise to the gradients (Fu et al. 2022). We use the compression rates of 0.8, 0.6, and 0.4, and the Gaussian noise standard deviations of 0.0001, 0.001, and 0.01 for evaluation. ...

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
  • Citing Article
  • December 2017

... Finally, we selected relatively small LLMs due to limited computational power. Models with a larger size should be considered to further improve the predictions, leveraging techniques such as quantization to reduce computational and memory requirements 40,41 . ...

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
  • Citing Conference Paper
  • October 2016

... Different ways of using sparsity. Savings due to weight and activation sparsity are achieved using the efficient inference engine (EIE) CNN accelerator[21] in an image classification task. Savings due to weight and delta activation sparsity are achieved using the Spartus RNN accelerator [22] on a speech recognition task. ...

EIE: Efficient Inference Engine on Compressed Deep Neural Network
  • Citing Conference Paper
  • February 2016

... This irregularity significantly hinders the utilization of performance-critical features in modern GPUs, such as coalesced memory access and warp-level synchronization. To overcome these limitations, structured sparsity has been introduced [37,39,42], effectively eliminating performance issues by representing data with regular sparse patterns. Also, this new paradigm of sparse computing has been supported by NVIDIA GPUs in its Sparse Tensor Cores (SpTCs) since the Ampere architecture, featuring a 2× peak performance boost compared to its dense counterparts [40]. ...

Exploring the Granularity of Sparsity in Convolutional Neural Networks
  • Citing Conference Paper
  • July 2017

... TensorFlow Lite and PyTorch Mobile offer post-training quantization tools that allow size reduction without significant accuracy loss. Deep compression methods, such as those proposed by [13], further optimize models through pruning and weight sharing. ...

Deep compression and EIE: Efficient inference engine on compressed deep neural network
  • Citing Conference Paper
  • August 2016

... In their study, Mao et al. [26] delved into different granularities of sparsity, producing notable findings worth highlighting. Firstly, they observed that very coarse-grained sparsity, such as filter sparsity and channel sparsity, facilitates ease of implementation on hardware accelerators. ...

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

... While high-precision classification models, such as ResNet and VGG, have demonstrated strong performance, their high computational complexity (e.g., ResNet-18 contains 11.7M parameters) limits their applicability to real-time, embedded in-vehicle platforms [11]. Lightweight models [12][13][14] and model compression techniques [15][16][17][18] can reduce computational overhead but often suffer from accuracy degradation when trained on small datasets. Therefore, achieving high accuracy, low latency, and strong generalizability in rail surface condition identification remains a critical challenge. ...

Trained Ternary Quantization
  • Citing Article
  • December 2016