Gurshaant Malik’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Figure 2: The LMU and implicit self-attention architecture along with output dimensions. In the illustration, n refers to the sequence length, q is the order and q is the reduced order, and d is the embedding dimension. Normalization layers and skip connections are not shown. One variant uses the FFN component right after the input, and the other variant uses global attention.
Figure 3: Cross-entropy scores in nats, averaged across all the tokens in the sequence. Transformers and LSTMs fits are from Kaplan et al. (2020). Our models perform better than Transformers and LSTM models up to 1 million non-embedding parameters.
Figure 5: Approximately matching the loss between transformers and LMUs requires 10x more training for the transformer. The LMU and Attention model continues to significantly outperform transformers with 10x less training.
Parameter counts and compute (forward pass) for one layer of the network, per token. The first row indicates the number of FLOPs when following the implementation in Section A.1.
Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers
  • Preprint
  • File available

October 2021

·

300 Reads

Narsimha Chilkuri

·

·

Aaron Voelker

·

[...]

·

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for sequence processing and exhibits an O(n) and O(nlnn)O(n \ln n) (or better) dependency for memory and computation respectively. Over three orders of magnitude, we show that our new architecture attains the same accuracy as transformers with 10x fewer tokens. We also show that for the same amount of training our model improves the loss over transformers about as much as transformers improve over LSTMs. Additionally, we demonstrate that adding global self-attention complements our architecture and the augmented model improves performance even further.

Download

Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware

September 2020

·

31 Reads

Keyword spotting (KWS) provides a critical user interface for many mobile and edge applications, including phones, wearables, and cars. As KWS systems are typically 'always on', maximizing both accuracy and power efficiency are central to their utility. In this work we use hardware aware training (HAT) to build new KWS neural networks based on the Legendre Memory Unit (LMU) that achieve state-of-the-art (SotA) accuracy and low parameter counts. This allows the neural network to run efficiently on standard hardware (212μ\muW). We also characterize the power requirements of custom designed accelerator hardware that achieves SotA power efficiency of 8.79μ\muW, beating general purpose low power hardware (a microcontroller) by 24x and special purpose ASICs by 16x.