Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to Infer Hardware Performances

Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.


Calculating the most efficient schedule of work in a neural network compiler is a difficult task. There are many parameters to be accounted for that can positively or adversely affect that schedule depending on their configuration - How work is shared between distributed targets, the subdivision of tensors to fit in memory, toggling the enablement of optimizations, etc. Traditionally, neural network compilers determine how to set these values by building a graph of choices and choosing the path with minimal 'cost'. These choices and their corresponding costs are usually determined by an algorithm crafted by engineers with a deep knowledge of the target platform. However, when the amount of options available to a compiler is large, it is very difficult to ensure that these models consistently produce an optimal schedule for all scenarios, whilst still completing compilation in an acceptable timeframe. This paper presents 'VPUNN' - a neural network-based cost model trained on low-level task profiling that consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Full-text available
This article demonstrates that convolutional operation can be converted to matrix multiplication, which has the same calculation way with fully connected layer. The article is helpful for the beginners of the neural network to understand how fully connected layer and the convolutional layer work in the backend. To be concise and to make the article more readable, we only consider the linear case. It can be extended to the non-linear case easily through plugging in a non-linear encapsulation to the values like this $\sigma(x)$ denoted as $x^{\prime}$.
Conference Paper
The success of deep learning at computer vision tasks has led to an ever-increasing number of applications on edge devices. Often with the use of edge AI hardware accelerators like the Intel Movidius Vision Processing Unit (VPU). Performing computer vision tasks on edge devices is challenging. Many Convolutional Neural Networks (CNNs) are too complex to run on edge devices with limited computing power. This has created large interest in designing efficient CNNs and one promising way of doing this is through Neural Architecture Search (NAS). NAS aims to automate the design of neural networks. NAS can also optimize multiple different objectives together, like accuracy and efficiency, which is difficult for humans. In this paper, we use a differentiable NAS method to find efficient CNNs for VPU that achieves state-of-the-art classification accuracy on ImageNet. Our NAS designed model outperforms MobileNetV2, having almost 1% higher top-1 accuracy while being 13% faster on MyriadX VPU. To the best of our knowledge, this is the first time a VPU specific CNN has been designed using a NAS algorithm. Our results also reiterate the fact that efficient networks must be designed for each specific hardware. We show that efficient networks targeted at different devices do not perform as well on the VPU.
Similarity search finds application in database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data parallel tasks such as distance computation, prior approaches in this domain are bottlenecked by algorithms that expose less parallelism, such as $k$ -min selection, or make poor use of the memory hierarchy. We propose a novel design for $k$ -selection. We apply it in different similarity search scenarios, by optimizing brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation operates at up to 55 percent of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5 × faster than prior GPU state of the art. It enables the construction of a high accuracy $k$ -NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.
Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.84, which is only 0.1 percent worse and 1.2x faster than the current state-of-the-art model. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art.
An expert system for recommending routing and sealing (ROSE) of asphalt concrete pavements in cold areas was developed. The system incorporates data transmitted by 41 variables, such as pavement serviceability, age, and types of pavement surface distress, and encodes expertise derived from recent research and development studies and from experience. The system recommendations are given as a desirability of routing and sealing on a scale from 0 to 10. The interactive version of ROSE was developed and calibrated using an expert system development shell. This resulted in significant savings in programming, testing, and calibration. An automatic version of ROSE was implemented in FORTRAN and successfully applied to about 900 pavement sections, representing about 7200 km of highway.
The CoRa tensor compiler: Compilation for ragged tensors with minimal padding
  • Pratik Fegade
  • Tianqi Chen
  • Phillip B Gibbons
  • Todd C Mowry
Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, and Todd C. Mowry. The CoRa tensor compiler: Compilation for ragged tensors with minimal padding, 2021. URL 2110.10221.
Accelerating large-scale inference with anisotropic vector quantization
  • Ruiqi Guo
  • Philip Sun
  • Erik M Lindgren
  • Quan Geng
  • David Simcha
  • Felix Chern
  • Sanjiv Kumar
Ruiqi Guo, Philip Sun, Erik M. Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In ICML, 2020.
Predicting convolution performance of deep learning accelerators. Pre-publication manuscript without individual authorship
  • Intel Habana
Intel Habana. Predicting convolution performance of deep learning accelerators. Pre-publication manuscript without individual authorship, 2021.
Paleo: A performance model for deep neural networks
  • Hang Qi
  • Evan R Sparks
  • Ameet S Talwalkar
Hang Qi, Evan R. Sparks, and Ameet S. Talwalkar. Paleo: A performance model for deep neural networks. In ICLR, 2017.