Naftali Tishby

Naftali Tishby
  • Professor
  • Professor (Full) at Hebrew University of Jerusalem

About

290
Publications
59,629
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
22,003
Citations
Current institution
Hebrew University of Jerusalem
Current position
  • Professor (Full)

Publications

Publications (290)
Article
Full-text available
Biological systems often choose actions without an explicit reward signal, a phenomenon known as intrinsic motivation. The computational principles underlying this behavior remain poorly understood. In this study, we investigate an information-theoretic approach to intrinsic motivation, based on maximizing an agent's empowerment (the mutual informa...
Preprint
Full-text available
Biological systems often choose actions without an explicit reward signal, a phenomenon known as intrinsic motivation. The computational principles underlying this behavior remain poorly understood. In this study, we investigate an information-theoretic approach to intrinsic motivation, based on maximizing an agent's empowerment (the mutual informa...
Article
Full-text available
The attentional blink (AB) effect is the reduced probability of reporting a second target (T2) that appears shortly after a first one (T1) within a rapidly presented sequence of distractors. The AB effect has been shown to be reduced following intensive mental training in the form of mindfulness meditation, with a corresponding reduction in T1-evok...
Article
It has been proposed that semantic systems evolve under pressure for efficiency. This hypothesis has so far been supported largely indirectly, by synchronic cross-language comparison, rather than directly by diachronic data. Here, we directly test this hypothesis in the domain of color naming, by analyzing recent diachronic data from Nafaanra, a la...
Preprint
Full-text available
It has been proposed that semantic systems evolve under pressure for efficiency. This hypothesis has so far been supported largely indirectly, by synchronic cross-language comparison, rather than directly by diachronic data. Here, we directly test this hypothesis in the domain of color naming, by analyzing recent diachronic data from Nafaanra, a la...
Preprint
In Rate Distortion (RD) problems one seeks reduced representations of a source that meet a target distortion constraint. Such optimal representations undergo topological transitions at some critical rate values, when their cardinality or dimensionality change. We study the convergence time of the Arimoto-Blahut alternating projection algorithms, us...
Article
Full-text available
We introduce a novel methodology for describing animal behavior as a tradeoff between value and complexity, using the Morris Water Maze navigation task as a concrete example. We develop a dynamical system model of the Water Maze navigation task, solve its optimal control under varying complexity constraints, and analyze the learning process in term...
Article
Full-text available
Objective. One of the recent developments in the field of brain–computer interfaces (BCI) is the reinforcement learning (RL) based BCI paradigm, which uses neural error responses as the reward feedback on the agent’s action. While having several advantages over motor imagery based BCI, the reliability of RL-BCI is critically dependent on the decodi...
Preprint
Full-text available
The Information Bottleneck (IB) framework is a general characterization of optimal representations obtained using a principled approach for balancing accuracy and complexity. Here we present a new framework, the Dual Information Bottleneck (dualIB), which resolves some of the known drawbacks of the IB. We provide a theoretical analysis of the dualI...
Article
Full-text available
Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Nonlinear CCA extends this notion to a broader family of transformations, which are more powerful in many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE) al...
Article
Full-text available
The limited capacity of recent memory inevitably leads to partial memory of past stimuli. There is also evidence that behavioral and neural responses to novel or rare stimuli are dependent on one’s memory of past stimuli. Thus, these responses may serve as a probe of different individuals’ remembering and forgetting characteristics. Here, we utiliz...
Article
Machine learning (ML) encompasses a broad range of algorithms and modeling tools used for a vast array of data processing tasks, which has entered most scientific disciplines in recent years. This article reviews in a selective way the recent research on the interface between machine learning and the physical sciences. This includes conceptual deve...
Article
We describe a novel model of human eye gaze behavior under workload, derived from the basic principle of information constrained control. The model assumes two distributions over the visual field: A saliency distribution, which is nongoal oriented, and a reward task-related distribution. The eye gaze behavior is determined by the tradeoff between t...
Preprint
Full-text available
It has been argued that semantic categories across languages reflect pressure for efficient communication. Recently, this idea has been cast in terms of a general information-theoretic principle of efficiency, the Information Bottleneck (IB) principle, and it has been shown that this principle accounts for the emergence and evolution of named color...
Preprint
Full-text available
The limited capacity of recent memory inevitably leads to partial memory of past stimuli. There is also evidence that behavioral and neural responses to novel or rare stimuli are dependent on one’s memory of past stimuli. Thus, these responses may serve as a probe of different individuals’ remembering and forgetting characteristics. Here, we utiliz...
Article
Full-text available
Colour naming across languages has traditionally been held to reflect the structure of colour perception. At the same time, it has often, and increasingly, been suggested that colour naming may be shaped by patterns of communicative need. However, much remains unknown about the factors involved in communicative need, how need interacts with percept...
Preprint
Full-text available
Machine learning encompasses a broad range of algorithms and modeling tools used for a vast array of data processing tasks, which has entered most scientific disciplines in recent years. We review in a selective way the recent research on the interface between machine learning and physical sciences.This includes conceptual developments in machine l...
Article
Gibson et al. (2017) argued that color naming is shaped by patterns of communicative need. In support of this claim, they showed that color naming systems across languages support more precise communication about warm colors than cool colors, and that the objects we talk about tend to be warm‐colored rather than cool‐colored. Here, we present new a...
Preprint
Full-text available
Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Non-linear CCA extends this notion to a broader family of transformations, which are more powerful for many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE)...
Article
Full-text available
Significance Semantic typology documents and explains how languages vary in their structuring of meaning. Information theory provides a formal model of communication that includes a precise definition of efficient compression. We show that color-naming systems across languages achieve near-optimal compression and that this principle explains much o...
Preprint
Full-text available
Gibson et al. (2017) argued that color naming is shaped by patterns of communicative need. In support of this claim, they showed that color naming systems across languages support more precise communication about warm colors than cool colors, and that the objects we talk about tend to be warm-colored rather than cool-colored. Here, we present new a...
Preprint
The attentional blink (AB) effect is the reduced ability of subjects to report a second target stimuli (T2) amonga rapidly presented series of non-target stimuli, when it appears within a time window of about 200-500 msafter a first target (T1). We present a simple dynamical systems model explaining the AB as resulting fromthe temporal response dyn...
Article
Full-text available
In an era of big data there is a growing need for memory-bounded learning algorithms. In the last few years researchers have investigated what cannot be learned under memory constraints. In this paper we focus on the complementary question of what can be learned under memory constraints. We show that if a hypothesis class fulfills a combinatorial c...
Preprint
Maintaining efficient semantic representations of the environment is a major challenge both for humans and for machines. While human languages represent useful solutions to this problem, it is not yet clear what computational principle could give rise to similar solutions in machines. In this work we propose an answer to this open question. We sugg...
Article
Full-text available
Understanding the computational implications of specific synaptic connectivity patterns is a fundamental goal in neuroscience. In particular, the computational role of ubiquitous electrical synapses operating via gap junctions remains elusive. In the fly visual system, the cells in the vertical-system network, which play a key role in visual proces...
Data
(A) The information about the axis of rotation encoded by the axonal voltages of the VS 5-6-7 triplet with the integration window extending from 10 ms to 50 ms, with (in blue) and without (in orange) GJs, respectively. (B) The colored bars show the information about the axis of rotation encoded by the axonal voltages of the VS 5-6-7 triplet for GJs...
Data
Goodness of fit for the Gaussian copula combining dendrite input and stimuli θ. (A) The ten most significant principal components from the currents and the percentages of variance that they explain individually based on the natural stimuli. Note that 90% of the variance can be explained with the two most significant principal components. (B) The qu...
Data
Optic flow of rotation in the counterclockwise direction around the axis θ (black dashed line). Note that this rotation yields no motion at the axis itself. The further away the respective azimuth degree is from the axis θ (up to 90°), the greater the rotation. (TIF)
Data
Emergence of hyperacuity and improvement in discrimination with GJs for the representation by VS 5-6-7 triplet. (A) The discriminability, d’, between θ and θ' (θ—θ’ = Δθ) for all axes of rotation as a function of Δθ, with (yellow) and without (blue) GJs. Error bars indicate one standard deviation. Note that only the blue curve intersects the hypera...
Data
Both smoothing (reducing trial-to-trial variability) and improving correlation contributes to better encoding of axis of rotation. (A) Joint axonal voltage response of VS5 versus VS6 in the absence of GJs. A total of 1000 samples for both θ = 0° (green) and for θ = 60° (red) in response to natural stimuli are shown (see Materials and Methods). Thei...
Data
Encodings by triplets of VS neurons is divided to clusters according to their tuning spacing. Encoding of triplets for natural stimuli (with GJs), color coded according to the triplet tuning spacing (see text). Note that the e.g., the cluster in red contains both triplets with spacing of 32° as well as triplets with spacing 48° and contain VS1 or V...
Data
With GJs, encoding by the VS 5-6-7 triplet shows hyperacuity level discrimination. (DOCX)
Article
Full-text available
The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-of...
Article
Full-text available
The interaction between an artificial agent and its environment is bi-directional. The agent extracts relevant information from the environment, and affects the environment by its actions in return to accumulate high expected reward. Standard reinforcement learning (RL) deals with the expected reward maximization. However, there are always informat...
Article
Full-text available
We suggest analyzing neural networks through the prism of space constraints. We observe that most training algorithms applied in practice use bounded memory, which enables us to use a new notion introduced in the study of space-time tradeoffs that we call mixing complexity. This notion was devised in order to measure the (in)ability to learn using...
Article
Full-text available
Despite their great success, there is still no com- prehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their in- ner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and...
Article
Full-text available
Stochastic dynamic control systems relate in a prob- abilistic fashion the space of control signals to the space of corresponding future states. Consequently, stochastic dynamic systems can be interpreted as an information channel between the control space and the state space. In this work we study this control-to-state informartion capacity of sto...
Conference Paper
Retentive (memory-utilizing) sensing-acting agents, when they are distributed or power-constrained, often operate under limitations on the communication between their sensing, memory and acting components. This requires them to trade off the external cost that they incur with the capacity of their communication channels, which translates into the s...
Conference Paper
With the increased demand for power efficiency in feedback-control systems, communication is becoming a limiting factor, raising the need to trade off the external cost that they incur with the capacity of the controller's communication channels. With a proper design of the channels, this translates into a sequential rate-distortion problem, where...
Conference Paper
It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good set...
Article
Full-text available
The attentional blink (AB) effect is the reduced ability of subjects to report a second target stimuli (T2) among a rapidly presented series of non-target stimuli, when it appears within a time window of about 200-500 ms after a first target (T1). We present a simple dynamical systems model explaining the AB as resulting from the temporal response...
Article
Full-text available
Author A crucial aspect of all life is the ability to use past events in order to guide future behavior. To do that, creatures need the ability to predict future events. Indeed, predictability has been shown to affect neuronal responses in many animals and under many conditions. Clearly, the quality of predictions should depend on the amount and d...
Conference Paper
Full-text available
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. W...
Preprint
Retentive (memory-utilizing) sensing-acting agents may operate under limitations on the communication between their sensing, memory and acting components, requiring them to trade off the external cost that they incur with the capacity of their communication channels. In this paper we formulate this problem as a sequential rate-distortion problem of...
Preprint
With the increased demand for power efficiency in feedback-control systems, communication is becoming a limiting factor, raising the need to trade off the external cost that they incur with the capacity of the controller's communication channels. With a proper design of the channels, this translates into a sequential rate-distortion problem, where...
Article
Full-text available
There is a remarkable consensus that human and non-human subjects experience temporal distortions in many stages of their perceptual and decision-making systems. Similarly, intertemporal choice research has shown that decision-makers undervalue future outcomes relative to immediate ones. Here we combine techniques from information theory and artifi...
Preprint
In POMDPs, information about the hidden state, delivered through observations, is both valuable to the agent, allowing it to base its actions on better informed internal states, and a "curse", exploding the size and diversity of the internal state space. One attempt to deal with this is to focus on reactive policies, that only base their actions on...
Article
Full-text available
Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a consolidated presentation of a theory of bounded rationality based on information-theoretic ideas. We...
Article
Full-text available
We present new tools for categorizing chords based on corpus data, applicable to a variety of representations from Roman numerals to MIDI notes. Using methods from information theory, we propose that harmonic theories should be evaluated by at least two criteria, accuracy (how well the theory describes the musical surface) and complexity (the effic...
Article
Full-text available
Linear models have been used in several contexts to study the mechanisms that underpin sensorimotor synchronization. Given that their parameters are often linked to psychological processes such as phase correction and period correction, the fit of the parameters to experimental data is an important practical question. We present a unified method fo...
Article
The mechanisms that support sensorimotor synchronization — that is, the temporal coordination of movement with an external rhythm — are often investigated using linear computational models. The main method used for estimating the parameters of this type of model was established in the seminal work of Vorberg and Schulze (2002), and is based on fitt...
Article
Full-text available
Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain f...
Conference Paper
Full-text available
We suggest a unified view of two known prediction algorithms: Context Tree Weighting (CTW) and Prediction Suffix Tree (PST), by formulating them as information limited control problems. Using a unified view of planning and information gathering we suggest a new algorithm that combines the advantages of these two extreme algorithms and interpolates...
Conference Paper
Full-text available
Previous work has shown that classical sequential decision making rules, including expectimax and minimax, are limit cases of a more general class of bounded rational planning problems that trade off the value and the complexity of the solution, as measured by its information divergence from a given reference. This allows modeling a range of novel...
Article
Full-text available
The common approaches to feature extraction in speech processing are generative and parametric although they are highly sensitive to violations of their model assumptions. Here, we advocate the non-parametric Information Bottleneck (IB). IB is an information theoretic approach that extends minimal sufficient statistics. However, unlike minimal suff...
Preprint
The Information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution P(A,B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. The information bottleneck has already been applied to document classification, gene exp...
Article
Full-text available
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the othe...
Article
Full-text available
Exponential models of distributions are widely used in machine learning for classiffication and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classiffication tasks, mutual information is a more suitable information theoretic measure to be o...
Conference Paper
In Passive POMDPs actions do not affect the world state, but still incur costs. When the agent is bounded by information-processing constraints, it can only keep an approximation of the belief. We present a variational principle for the problem of maintaining the information which is most useful for minimizing the cost, and introduce an efficient a...
Article
Full-text available
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both g...
Chapter
Full-text available
Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In partic-ular, the information gained from the environ...
Article
Video surveillance is becoming the technology of choice for monitoring crowded areas for security threats. While video provides ample information for human inspectors, there is a great need for robust automated techniques that can efficiently detect anomalous behavior in streaming video from single or multiple cameras. In this work we synergistical...
Article
Full-text available
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner i...
Article
Full-text available
Previous reinforcement-learning models of the basal ganglia network have highlighted the role of dopamine in encoding the mismatch between prediction and reality. Far less attention has been paid to the computational goals and algorithms of the main-axis (actor). Here, we construct a top-down model of the basal ganglia with emphasis on the role of...
Article
Full-text available
Spectral clustering is a modern and well known method for performing data clustering. However, it depends on the availability of a similarity matrix, which in many applications can be non-trivial to obtain. In this paper, we focus on the problem of performing spectral clustering under a budget constraint, where there is a limit on the number of ent...
Chapter
Full-text available
The perception-action-cycle is often defined as "the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal". The question we address in this paper is in what sense this "flow of information" can be described by Shannon's measures of information introduced in his mat...
Article
Full-text available
We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish...
Article
Full-text available
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L_2 regularization: We introduce the \gamma-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both go...
Article
Full-text available
Clustering stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even...
Article
Full-text available
We applied PAC-Bayesian framework to derive gen-eralization bounds for co-clustering 1 . The analysis yielded regularization terms that were absent in the preceding formulations of this task. The bounds sug-gested that co-clustering should optimize a trade-off between its empirical performance and the mutual in-formation that the cluster variables...
Article
Full-text available
We consider a supervised learning setting in which the main cost of learning is the number of training labels and one can obtain a single label for a bag of examples, indicating only if a positive example exists in the bag, as in Multi- Instance Learning. We thus propose to create a training sample of bags, and to use the obtained labels to learn t...
Conference Paper
Full-text available
This paper explores a novel approach for the extrac tion of relevant information in speaker recognition tasks. This approach uses a principled information theoretic fr amework - the Information Bottleneck method (IB). In our applic ation, the method compresses the acoustic data while prese rving mostly the relevant information for speaker identif i...
Article
Full-text available
Biological systems need to process information in real time and must trade off accuracy of presentation and coding costs. Here we operationalize this trade-off and develop an information-theoretic framework that selectively extracts information of the input past that is predictive about the output future, obtaining a generalized eigenvalue problem....
Article
Full-text available
The study of complex information processing systems requires appropriate theoretical tools to help unravel their underlying design principles. Information theory is one such tool, and has been utilized extensively in the study of the neural code. Although much progress has been made in information theoretic methodology, there is still no satisfying...
Article
Full-text available
We derive a PAC-Bayesian generalization bound for density estimation. Similar to the PAC-Bayesian generalization bound for clas- sication, the result has the appealingly sim- ple form of a tradeo between empirical per- formance and the KL-divergence of the pos- terior from the prior. Moreover, the PAC- Bayesian generalization bound for classica- ti...
Conference Paper
Full-text available
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances, typically a Boolean OR. The learner observes the bag labels but not the instance labels that generated them. MIL has numerous applications, and many heuristic algorithms have...

Network

Cited By