## About

731

Publications

93,351

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

52,302

Citations

Citations since 2017

Introduction

**Skills and Expertise**

## Publications

Publications (731)

We explore whether Large Language Models (LLMs) are capable of logical reasoning with distorted facts, which we call Deduction under Perturbed Evidence (DUPE). DUPE presents a unique challenge to LLMs since they typically rely on their parameters, which encode mostly accurate information, to reason and make inferences. However, in DUPE, LLMs must r...

In this paper, we introduce Parallel Attention and Feed-Forward Net Design (PAF) for transformer models. Transformer models are indisputably the backbone of all Natural Language Processing applications. Therefore, any efforts aimed at improving their efficiency are guaranteed to have an enormous impact. Transformer models consist of many layers and...

We present a design framework called Conversational Learning with Analytical Step-by-Step Strategies (CLASS) for developing high-performance Intelligent Tutoring Systems (ITS). The CLASS framework aims to empower ITS with with two critical capabilities: imparting tutor-like step-by-step guidance and enabling tutor-like conversations in natural lang...

The first step toward investigating the effectiveness of a treatment via a randomized trial is to split the population into control and treatment groups then compare the average response of the treatment group receiving the treatment to the control group receiving the placebo. To ensure that the difference between the two groups is caused only by t...

Pairwise dot product-based self-attention is key to the success of transformers which achieve state-of-the-art performance across a variety of applications in language and vision, but are costly to compute. It has been shown that most attention scores and keys in transformers are redundant and can be removed without loss of accuracy. In this paper,...

Accurate estimation of the states of a nonlinear dynamical system is crucial for their design, synthesis, and analysis. Particle filters are estimators constructed by simulating trajectories from a sampling distribution and averaging them based on their importance weight. For particle filters to be computationally tractable, it must be feasible to...

Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of the nonlinear activation function employed in its multilayer perceptron (MLP) network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also...

Microfluidics can split samples into thousands or millions of partitions such as droplets or nanowells. Partitions capture analytes according to a Poisson distribution, and in diagnostics, the analyte concentration is commonly calculated with a closed-form solution via maximum likelihood estimation (MLE). Here, we present a generalization of MLE wi...

This paper investigates the problem of Named Entity Recognition (NER) for extreme low-resource languages with only a few hundred tagged data samples. NER is a fundamental task in Natural Language Processing (NLP). A critical driver accelerating NER systems' progress is the existence of large-scale language corpora that enable NER systems to achieve...

Foveated imaging provides a better tradeoff between situational awareness (field of view) and resolution and is critical in long-wavelength infrared regimes because of the size, weight, power, and cost of thermal sensors. We demonstrate computational foveated imaging by exploiting the ability of a meta-optical frontend to discriminate between diffe...

We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets...

We study the generalization behavior of transfer learning of deep neural networks (DNNs). We adopt the overparameterization perspective -- featuring interpolation of the training data (i.e., approximately zero train error) and the double descent phenomenon -- to explain the delicate effect of the transfer learning setting on generalization performa...

We take a random matrix theory approach to random sketching and show an asymptotic first-order equivalence of the regularized sketched pseudoinverse of a positive semidefinite matrix to a certain evaluation of the resolvent of the same matrix. We focus on real-valued regularization and extend previous results on an asymptotic equivalence of random...

We study the interpolation capabilities of implicit neural representations (INRs) of images. In principle, INRs promise a number of advantages, such as continuous derivatives and arbitrary sampling, being freed from the restrictions of a raster grid. However, empirically, INRs have been observed to poorly interpolate between the pixels of the fit i...

Transformer models trained on massive text corpora have become the de facto models for a wide range of natural language processing tasks. However, learning effective word representations for function words remains challenging. Multimodal learning, which visually grounds transformer models in imagery, can overcome the challenges to some extent; howe...

We introduce a new neural signal model designed for efficient high-resolution representation of large-scale signals. The key innovation in our multiscale implicit neural representation (MINER) is an internal representation via a Laplacian pyramid, which provides a sparse multiscale decomposition of the signal that captures orthogonal parts of the s...

A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a D...

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of...

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the pe...

Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the pe...

Does a neural network's privacy have to be at odds with its accuracy? In this work, we study the effects the number of training epochs and parameters have on a neural network's vulnerability to membership inference (MI) attacks, which aim to extract potentially private information about the training data. We first demonstrate how the number of trai...

Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, whi...

Compressed sensing (CS) is a signal processing technique that enables the efficient recovery of a sparse high-dimensional signal from low-dimensional measurements. In the multiple measurement vector (MMV) framework, a set of signals with the same support must be recovered from their corresponding measurements. Here, we present the first exploration...

DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervis...

We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while m...

Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities, e.g., re...

We present Polarity Sampling, a theoretically justified plug-and-play method for controlling the generation quality and diversity of pre-trained deep generative networks DGNs). Leveraging the fact that DGNs are, or can be approximated by, continuous piecewise affine splines, we derive the analytical DGN output space distribution as a function of th...

Recurrent Neural Networks (RNNs) are important tools for processing sequential data such as time-series or video. Interpretability is defined as the ability to be understood by a person and is different from explainability, which is the ability to be explained in a mathematical formulation. A key interpretability issue with RNNs is that it is not c...

Knowledge tracing refers to the problem of estimating each student's knowledge component/skill mastery level from their past responses to questions in educational applications. One direct benefit knowledge tracing methods provide is the ability to predict each student's performance on the future questions. However, one key limitation of most existi...

K-means defines one of the most employed centroid-based clustering algorithms with performances tied to the data's embedding. Intricate data embeddings have been designed to push $K$-means performances at the cost of reduced theoretical guarantees and interpretability of the results. Instead, we propose preserving the intrinsic data space and augme...

We introduce a new neural signal representation designed for the efficient high-resolution representation of large-scale signals. The key innovation in our multiscale implicit neural representation (MINER) is an internal representation via a Laplacian pyramid, which provides a sparse multiscale representation of the signal that captures orthogonal...

A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this pap...

Graph filtering is the cornerstone operation in graph signal processing (GSP). Thus, understanding it is key in developing potent GSP methods. Graph filters are local and distributed linear operations, whose output depends only on the local neighborhood of each node. Moreover, a graph filter's output can be computed separately at each node by carry...

Time-Frequency Signal Analysis and Processing (TFSAP) is a collection of theory and algorithms used for analysis and processing of nonstationary signals, as found in a wide range of applications including telecommunications, radar, and biomedical engineering. This book brings together the main knowledge of TFSAP, from theory to applications, in a u...

The first step towards investigating the effectiveness of a treatment is to split the population into the control and the treatment groups, then compare the average responses of the two groups to the treatment. In order to ensure that the difference in the two groups is only caused by the treatment, it is crucial for the control and the treatment g...

Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold, and data distribution on that manifold. However, training samples are often obtained based on preferences, costs, or convenience producing artifacts in the empirica...

Deep neural networks (DNs) provide superhuman performance in numerous computer vision tasks, yet it remains unclear exactly which of a DN's units contribute to a particular decision. NeuroView is a new family of DN architectures that are interpretable/explainable by design. Each member of the family is derived from a standard DN architecture by vec...

Deep neural networks have become essential for numerous applications due to their strong empirical performance such as vision, RL, and classification. Unfortunately, these networks are quite difficult to interpret, and this limits their applicability in settings where interpretability is important for safety, such as medical imaging. One type of de...

We develop a measure for evaluating the performance of generative networks given two sets of images. A popular performance measure currently used to do this is the Fr\'echet Inception Distance (FID). However, FID assumes that images featurized using the penultimate layer of Inception-v3 follow a Gaussian distribution. This assumption allows FID to...

We study the problem of generating arithmetic math word problems (MWPs) given a math equation that specifies the mathematical computation and a context that specifies the problem scenario. Existing approaches are prone to generating MWPs that are either mathematically invalid or have unsatisfactory language quality. They also either ignore the cont...

The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, whic...

Motivation
The scalable design of safe guide RNA sequences for CRISPR gene editing depends on the computational “scoring” of DNA locations that may be edited. As there is no widely accepted benchmark dataset to compare scoring models, we present a curated “TrueOT” dataset that contains thoroughly validated datapoints to best reflect the properties...

Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Deconvolutional Generative Model (DGM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The DGM uses a CNN to design the prior distribution in the probabilistic model....

Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweight...

In this work, we explore weakly supervised machine learning for classifying questions into distinct Bloom’s Taxonomy levels. Bloom’s levels provide important information that guides teachers and adaptive learning algorithms in selecting appropriate questions for their students. However, manually providing Bloom labels is expensive and labor-intensi...

We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the lear...

Online education platforms enable teachers to share a large number of educational resources such as questions to form exercises and quizzes for students. With large volumes of available questions, it is important to have an automated way to quantify their properties and intelligently select them for students, enabling effective and personalized lea...

Feedback on student answers and even during intermediate steps in their solutions to open-ended questions is an important element in math education. Such feedback can help students correct their errors and ultimately lead to improved learning outcomes. Most existing approaches for automated student solution analysis and feedback require manually co...

We introduce a novel video-rate hyperspectral imager with high spatial, temporal and spectral resolutions. Our key hypothesis is that spectral profiles of pixels within each super-pixel tend to be similar. Hence, a scene-adaptive spatial sampling of an hyperspectral scene, guided by its super-pixel segmented image, is capable of obtaining high-qual...

Knowledge graphs link entities through relations to provide a structured representation of real world facts. However, they are often incomplete, because they are based on only a small fraction of all plausible facts. The task of knowledge graph completion via link prediction aims to overcome this challenge by inferring missing facts represented as...

This competition concerns educational diagnostic questions, which are pedagogically effective, multiple-choice questions (MCQs) whose distractors embody misconceptions. With a large and ever-increasing number of such questions, it becomes overwhelming for teachers to know which questions are the best ones to use for their students. We thus seek to...

Jacobian-vector products (JVPs) form the backbone of many recent developments in Deep Networks (DNs), with applications including faster constrained optimization, regularization with generalization guarantees, and adversarial example sensitivity assessments. Unfortunately, JVPs are computationally expensive for real world DN architectures and requi...

Compressed sensing (CS) is a signal processing technique that enables the efficient recovery of a sparse high-dimensional signal from low-dimensional measurements. In the multiple measurement vector (MMV) framework, a set of signals with the same support must be recovered from their corresponding measurements. Here, we present the first exploration...

In this paper, we study the importance of pruning in Deep Networks (DNs) and motivate it based on the current absence of data aware weight initialization. Current DN initializations, focusing primarily at maintaining first order statistics of the feature maps through depth, force practitioners to overparametrize a model in order to reach high perfo...

Ridge-like regularization often leads to improved generalization performance of machine learning models by mitigating overfitting. While ridge-regularized machine learning methods are widely used in many important applications, direct training via optimization could become challenging in huge data scenarios with millions of examples and features. W...

We introduce an end-to-end (E2E) deep neural network architecture designed to perform seismo-volcanic monitoring focused on detecting change. Due to the complexity of volcanic processes, this requires a polyphonic detection, segmentation, and classification approach. Through evolving epistemic uncertainty, invoking a Bayesian network strategy, we d...

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs) that provide a powerful portal through which we view and analyze their inner workings. For instance, conditioned on the...

We design an interpretable clustering algorithm aware of the nonlinear structure of image manifolds. Our approach leverages the interpretability of $K$-means applied in the image space while addressing its clustering performance issues. Specifically, we develop a measure of similarity between images and centroids that encompasses a general class of...

Kernels derived from deep neural networks (DNNs) in the infinite-width provide not only high performance in a range of machine learning tasks but also new theoretical insights into DNN training dynamics and generalization. In this paper, we extend the family of kernels associated with recurrent neural networks (RNNs), which were previously derived...

High dimensionality poses many challenges to the use of data, from visualization and interpretation, to prediction and storage for historical preservation. Techniques abound to reduce the dimensionality of fixed-length sequences, yet these methods rarely generalize to variable-length sequences. To address this gap, we extend existing methods that r...