# Tomaso A. PoggioMassachusetts Institute of Technology | MIT · Center for Brains, Minds and Machines

Tomaso A. Poggio

McDermott Professor at MIT

## About

715

Publications

203,696

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

90,373

Citations

Citations since 2017

Introduction

Tomaso A. Poggio, is the Eugene McDermott Professor in the BCS Department at MIT and a member of CSAIL and the McGovern Institute. He is an honorary member of the Neuroscience Research Program, a member of the American Academy of Arts and Sciences, a Founding Fellow of AAAI, a founding member of the McGovern Institute for Brain Research. Among other prizes, he received the Laurea Honoris Causa from the University of Pavia, the 2003 Gabor Award and the Okawa Prize 2009. His h-index is > 100.

Additional affiliations

June 1981 - present

Education

September 1966 - September 1972

## Publications

Publications (715)

In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional fil...

We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ , which is the product of the Frobenius...

In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out ([Formula: see text]) stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More p...

Neural network classifiers are known to be highly vulnerable to adversarial perturbations in their inputs. Under the hypothesis that adversarial examples lie outside of the sub-manifold of natural images, previous work has investigated the impact of principal components in data on adversarial robustness. In this paper we show that there exists a ve...

One of the challenges facing artificial intelligence research today is designing systems capable of utilizing systematic reasoning to generalize to new tasks. The Abstraction and Reasoning Corpus (ARC) measures such a capability through a set of visual reasoning tasks. In this paper we report incremental progress on ARC and lay the foundations for...

We review and apply a computational theory based on the hypothesis that the feedforward path of the ventral stream in visual cortex’s main function is the encoding of invariant representations of images. A key justification of the theory is provided by a result linking invariant representations to small sample complexity for image recognition - tha...

One of the challenges facing artificial intelligence research today is designing systems capable of utilizing systematic reasoning to generalize to new tasks. The Abstraction and Reasoning Corpus (ARC) measures such a capability through a set of visual reasoning tasks. In this paper we report incremental progress on ARC and lay the foundations for...

Multi-baseline stereo is any number of techniques for computing depth maps from several, typically many, photographs of a scene with known camera parameters.

Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motiv...

In this paper, we propose an adaptation to the area under the curve (AUC) metric to measure the adversarial robustness of a model over a particular $\epsilon$-interval $[\epsilon_0, \epsilon_1]$ (interval of adversarial perturbation strengths) that facilitates unbiased comparisons across models when they have different initial $\epsilon_0$ performa...

Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We provide here a theoretical justification based on analysis of the associated gradient flow. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or...

The spatially-varying field of the human visual system has recently received a resurgence of interest with the development of virtual reality (VR) and neural networks. The computational demands of high resolution rendering desired for VR can be offset by savings in the periphery, while neural networks trained with foveated input have shown perceptu...

2020 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC. During the last few years, significant progress has been made in the theoretical understanding of deep networks. We review our contributions in the areas of approximation theory and optimization. We also introduce a new approach based on cross-validation leave-one-...

A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this w...

The main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in ap...

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization...

Overparametrized deep networks predict well, despite the lack of an explicit complexity control during training, such as an explicit regularization term. For exponential-type loss functions, we solve this puzzle by showing an effective regularization effect of gradient descent in terms of the normalized weights that are relevant for classification....

In solving a system of $n$ linear equations in $d$ variables $Ax=b$, the condition number of the $n,d$ matrix $A$ measures how much errors in the data $b$ affect the solution $x$. Bounds of this type are important in many inverse problems. An example is machine learning where the key task is to estimate an underlying function from a set of measurem...

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A satisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the following questions: 1) representation power of deep networks 2) optimization of the empirical risk 3) generalization properties of gr...

We review recent observations on the dynamical systems induced by gradient descent methods used for training deep networks and summarize properties of the solutions they converge to. Recent results illuminate the absence of overfitting in the special case of linear networks for binary classification. They prove that minimization of loss functions s...

The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and...

In this paper, we propose the use of data symmetries, in the sense of equivalences under signal transformations, as priors for learning symmetry-adapted data representations, i.e., representations that are equivariant to these transformations. We rely on a group-theoretic definition of equivariance and provide conditions for enforcing a learned rep...

Given two networks with the same training loss on a dataset, when would they have drastically different test losses and errors? Better understanding of this question of generalization may improve practical applications of deep networks. In this paper we show that with cross-entropy loss it is surprisingly simple to induce significantly different ge...

A main puzzle of deep neural networks (DNNs) revolves around the apparent absence of "overfitting", defined in this paper as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and...

An open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that approximation by trigonom...

In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volu...

We introduce SITD (Spatial IQ Test Dataset), a dataset used to evaluate the capabilities of computational models for pattern recognition and visual reasoning. SITD is a generator of images in the style of the Raven Progressive Matrices (RPM), a common IQ (Intelligence Quotient) test used to test analytical intelligence. RPMs are purely visual, and...

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the...

Spearman Correlation Coefficient between the dissimilarity structure constructed using the representation of 50 videos computed from the Spatiotemporal Convolutional Neural Network with learned templates and the neural data over all possible choices of the neural data time bin.
Neural data is most informative for action content of the stimulus at t...

a) Classification accuracy, within and across changes in 3D viewpoint for a Recurrent Convolutional Neural Network. This architecture does not outperform a purely feedforward baseline. b) A Recurrent Convolutional Neural Network does not produce a dissimilarity structure that better agrees with the neural data than a purely feedforward baseline.
(T...

Recurrent neural networks and RSA over time.
(DOCX)

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage...

Tuning properties of simple cells in cortical V1 can be described in terms of a “universal shape” characterized quantitatively by parameter values which hold across different species (Jones and Palmer 1987; Ringach 2002; Niell and Stryker 2008). This puzzling set of findings begs for a general explanation grounded on an evolutionarily important com...

In this work, we focus on the problem of image instance retrieval with deep descriptors extracted from pruned Convolutional Neural Networks (CNN). The objective is to heavily prune convolutional edges while maintaining retrieval performance. To this end, we introduce both data-independent and data-dependent heuristics to prune convolutional edges,...

The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produc...

Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least for the most successful Deep Convolutional Neural Networks (DCNNs) for visual processing, practitioners can always increase the network size to fit the training dat...

For hydrocarbon exploration, large volumes of data are acquired and used in physical modeling-based workflows to identify geologic features of interest such as fault networks, salt bodies, or, in general, elements of petroleum systems. The adjoint modeling step, which transforms the data into the model space, and subsequent interpretation can be ve...

While the universal approximation property holds both for hierarchical and shallow networks, deep networks can approximate the class of compositional functions as well as shallow networks but with exponentially lower number of training parameters and sample complexity. Compositional functions are obtained as a hierarchy of local constituent functio...

Image instance retrieval is the problem of retrieving images from a database which contain the same object. Convolutional Neural Network (CNN) based descriptors are becoming the dominant approach for generating {\it global image descriptors} for the instance retrieval problem. One major drawback of CNN-based {\it global descriptors} is that uncompr...

The paper reviews an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. Deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Explanation of a few...

We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be...

The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activa...

Recognizing the actions of others from visual stimuli is a crucial aspect of human visual perception that allows individuals to respond to social cues. Humans are able to identify similar behaviors and discriminate between distinct actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of...

The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identity-preserving transformations like depth-rotations. Current computational models of object recognition, including recent deep learning network...

There is a widespread interest among scientists in understanding a specific and well defined form of intelligence, that is human intelligence. For this reason we propose a stronger version of the original Turing test. In particular, we describe here an open-ended set of Turing++ questions that we are developing at the Center for Brains, Minds, and...

We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads...

Faces are an important and unique class of visual stimuli, and have been of interest to neuroscientists for many years. Faces are known to elicit certain characteristic behavioral markers, collectively labeled "holistic processing", while non-face objects are not processed holistically. However, little is known about the underlying neural mechanism...

We describe computational tasks - especially in vision - that correspond to compositional/hierarchical functions. While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but...

The ability to recognize the actions of others from visual input is essential
to humans' daily lives. The neural computations underlying action recognition,
however, are still poorly understood. We use magnetoencephalography (MEG)
decoding and a computational model to study action recognition from a novel
dataset of well-controlled, naturalistic vi...

Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system's optimal organization must be one containing special...

Learning embeddings of entities and relations is an efficient and versatile
method to perform machine learning on relational data such as knowledge graphs.
In this work, we propose holographic embeddings (HolE) to learn compositional
vector space representations of entire knowledge graphs. The proposed method is
related to holographic models of ass...

Gradient backpropagation (BP) requires symmetric feedforward and feedback
connections--the same weights must be used for forward and backward passes.
This "weight transport problem" [1] is thought to be the crux of BP's
biological implausibility. Using 15 different classification datasets, we
systematically study to what extent BP really depends on...

n the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neu...

The human brain can rapidly parse a constant stream of visual input. The majority of visual neuroscience studies, however, focus on responses to static, still-frame images. Here we use magnetoencephalography (MEG) decoding and a computational model to study invariant action recognition in videos. We created a well-controlled, naturalistic dataset t...

In i-theory a typical layer of a hierarchical architecture consists of HW
modules pooling the dot products of the inputs to the layer with the
transformations of a few templates under a group. Such layers include as
special cases the convolutional layers of Deep Convolutional Networks (DCNs) as
well as the non-convolutional layers (when the group c...

Reducing the amount of human supervision is a key problem in machine learn- ing and a natural approach is that of exploiting the relations (structure) among different tasks. This is the idea at the core of multi-task learning. In this context a fundamental question is how to incorporate the tasks structure in the learning problem. We tackle this qu...

Learning to predict multi-label outputs is challenging, but in many problems
there is a natural metric on the outputs that can be used to improve
predictions. In this paper we develop a loss function for multi-label learning,
based on the Wasserstein distance. The Wasserstein distance provides a natural
notion of dissimilarity for probability measu...

We analyze in this paper a random feature map based on a theory of invariance
I-theory introduced recently. More specifically, a group invariant signal
signature is obtained through cumulative distributions of group transformed
random projections. Our analysis bridges invariant feature learning with kernel
methods, as we show that this feature map...

The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples ( ). The next phase is likely to focus on algorithms capable of learning from very few labeled examples ( ), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based...

Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system's optimal organization must be one containing special...

Reducing the amount of human supervision is a key problem in machine learning
and a natural approach is that of exploiting the relations (structure) among
different tasks. This is the idea at the core of multi-task learning. In this
context a fundamental question is how to incorporate the tasks structure in the
learning problem.We tackle this quest...

We study the problem of learning from data representations that are invariant to transformations, and at the same time selective,
in the sense that two points have the same representation if one is the transformation of the other. The mathematical results
here sharpen some of the key claims of i-theory—a recent theory of feedforward processing in s...

The macaque Superior Temporal Sulcus (STS) is a brain area that receives and integrates inputs from both the ventral and dorsal visual processing streams (thought to specialize in form and motion processing respectively). For the processing of articulated actions, prior work has shown that even a small population of STS neurons contains sufficient...

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216.

One approach to computer object recognition and modeling the brain's ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolutio...

Populations of neurons in inferotemporal cortex (IT) maintain an explicit
code for object identity that also tolerates transformations of object
appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning
rules are not known, recent results [4, 5, 6] suggest the operation of an
unsupervised temporal-association-based method e.g.,...

We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the we...

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem , as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple levels (phonemes, words, or phrases). Approaches for d...

Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of...

Faces are a class of visual stimuli with unique significance, for a variety
of reasons. They are ubiquitous throughout the course of a person's life, and
face recognition is crucial for daily social interaction. Faces are also unlike
any other stimulus class in terms of certain physical stimulus characteristics.
Furthermore, faces have been empiric...

We develop a sampling extension of M-theory focused on invariance to scale
and translation. Quite surprisingly, the theory predicts an architecture of
early vision with increasing receptive field sizes and a high resolution fovea
-- in agreement with data about the cortical magnification factor, V1 and the
retina. From the slope of the inverse of t...

We have developed a computer system for reconstructing and analyzing three dimensional flight trajectories of flies. Its application to the study of the free flight behaviour of the fruitfly Drosophila melanogaster is described. The main results are: a) Drosophila males only occasionally track other flies; b) in such cases the fly's angular velocit...