Arnold W. M. Smeulders's research while affiliated with University of Amsterdam and other places

Publications (484)

Preprint
The standard approach to contrastive learning is to maximize the agreement between different views of the data. The views are ordered in pairs, such that they are either positive, encoding different views of the same object, or negative, corresponding to views of different objects. The supervisory signal comes from maximizing the total similarity o...
Preprint
Full-text available
The goal of this paper is Human-object Interaction (HO-I) detection. HO-I detection aims to find interacting human-objects regions and classify their interaction from an image. Researchers obtain significant improvement in recent years by relying on strong HO-I alignment supervision from [5]. HO-I alignment supervision pairs humans with their inter...
Preprint
Full-text available
Robustness against unwanted perturbations is an important aspect of deploying neural network classifiers in the real world. Common natural perturbations include noise, saturation, occlusion, viewpoint changes, and blur deformations. All of them can be modelled by the newly proposed transform-augmented convolutional networks. While many approaches f...
Preprint
We consider the problem of information compression from high dimensional data. Where many studies consider the problem of compression by non-invertible transformations, we emphasize the importance of invertible compression. We introduce new class of likelihood-based autoencoders with pseudo bijective architecture, which we call Pseudo Invertible En...
Preprint
Tracking multiple objects individually differs from tracking groups of related objects. When an object is a part of the group, its trajectory depends on the trajectories of the other group members. Most of the current state-of-the-art trackers follow the approach of tracking each object independently, with the mechanism to handle the overlapping tr...
Preprint
Full-text available
We focus on building robustness in the convolutions of neural visual classifiers, especially against natural perturbations like elastic deformations, occlusions and Gaussian noise. Existing CNNs show outstanding performance on clean images, but fail to tackle naturally occurring perturbations. In this paper, we start from elastic perturbations, whi...
Preprint
Scale is often seen as a given, disturbing factor in many vision tasks. When doing so it is one of the factors why we need more data during learning. In recent work scale equivariance was added to convolutional neural networks. It was shown to be effective for a range of tasks. We aim for accurate scale-equivariant convolutional neural networks (SE...
Article
Full-text available
In this paper, our aim is to provide human understandable intuitive factual and counterfactual explanations for the decisions of neural networks. Humans tend to reinforce their decisions by providing attributes and counterattributes. Hence, in this work, we utilize attributes as well as examples to provide explanations. In order to provide countere...
Preprint
Full-text available
We focus on the robustness of neural networks for classification. To permit a fair comparison between methods to achieve robustness, we first introduce a standard based on the mensuration of a classifier's degradation. Then, we propose natural perturbed training to robustify the network. Natural perturbations will be encountered in practice: the di...
Preprint
This paper studies visual search using structured queries. The structure is in the form of a 2D composition that encodes the position and the category of the objects. The transformation of the position and the category of the objects leads to a continuous-valued relationship between visual compositions, which carries highly beneficial information,...
Preprint
Human-object interaction recognition aims for identifying the relationship between a human subject and an object. Researchers incorporate global scene context into the early layers of deep Convolutional Neural Networks as a solution. They report a significant increase in the performance since generally interactions are correlated with the scene (\i...
Preprint
Full-text available
In this paper we aim to explore the general robustness of neural network classifiers by utilizing adversarial as well as natural perturbations. Different from previous works which mainly focus on studying the robustness of neural networks against adversarial perturbations, we also evaluate their robustness on natural perturbations before and after...
Preprint
Siamese trackers turn tracking into similarity estimation between a template and the candidate regions in the frame. Mathematically, one of the key ingredients of success of the similarity function is translation equivariance. Non-translation-equivariant architectures induce a positional bias during training, so the location of the target will be h...
Preprint
Occlusion is one of the most difficult challenges in object tracking to model. This is because unlike other challenges, where data augmentation can be of help, occlusion is hard to simulate as the occluding object can be anything in any shape. In this paper, we propose a simple solution to simulate the effects of occlusion in the latent space. Spec...
Preprint
Human-object interaction (HOI) detection is a core task in computer vision. The goal is to localize all human-object pairs and recognize their interactions. An interaction defined by a <verb, noun> tuple leads to a long-tailed visual recognition challenge since many combinations are rarely represented. The performance of the proposed models is limi...
Preprint
Neural operations as convolutions, self-attention, and vector aggregation are the go-to choices for recognizing short-range actions. However, they have three limitations in modeling long-range activities. This paper presents PIC, Permutation Invariant Convolution, a novel neural layer to model the temporal structure of long-range activities. It has...
Preprint
For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, measuring physical properties from visual observations is challenging due to the high number of causally underlying physical parameters -- including material properties and external forces. In this paper, we propose to measu...
Preprint
Full-text available
In this paper, we aim to explain the decisions of neural networks by utilizing multimodal information. That is counter-intuitive attributes and counter visual examples which appear when perturbed samples are introduced. Different from previous work on interpreting decisions using saliency maps, text, or visual patches we propose to use attributes a...
Preprint
For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, inferring specifics from visual observations is challenging due to the high number of causally underlying physical parameters -- including material properties and external forces. This paper addresses the problem of inferrin...
Preprint
Full-text available
In this paper, we aim to understand and explain the decisions of deep neural networks by studying the behavior of predicted attributes when adversarial examples are introduced. We study the changes in attributes for clean as well as adversarial images in both standard and adversarially robust networks. We propose a metric to quantify the robustness...
Preprint
The effectiveness of Convolutional Neural Networks (CNNs) has been substantially attributed to their built-in property of translation equivariance. However, CNNs do not have embedded mechanisms to handle other types of transformations. In this work, we pay attention to scale changes, which regularly appear in various tasks due to the changing dista...
Article
Full-text available
Visual repetition is ubiquitous in our world. It appears in human activity (sports, cooking), animal behavior (a bee’s waggle dance), natural phenomena (leaves in the wind) and in urban environments (flashing lights). Estimating visual repetition from realistic video is challenging as periodic motion is rarely perfectly static and stationary. To be...
Preprint
Full-text available
With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the ra...
Preprint
Updating the tracker model with adverse bounding box predictions adds an unavoidable bias term to the learning. This bias term, which we refer to as model decay, offsets the learning and causes tracking drift. While its adverse affect might not be visible in short-term tracking, accumulation of this bias over a long-term can eventually lead to a pe...
Preprint
Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method...
Preprint
Full-text available
Deep computer vision systems being vulnerable to imperceptible and carefully crafted noise have raised questions regarding the robustness of their decisions. We take a step back and approach this problem from an orthogonal direction. We propose to enable black-box neural networks to justify their reasoning both for clean and for adversarial example...
Preprint
The goal of this paper is to retrieve an image based on instance, attribute and category similarity notions. Different from existing works, which usually address only one of these entities in isolation, we introduce a cooperative embedding to integrate them while preserving their specific level of semantic representation. An algebraic structure def...
Chapter
Full-text available
The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popula...
Preprint
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been either disregarded or ill-used. We revisit the conventional definition of an activity and restrict it to "Complex Action": a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works us...
Chapter
We introduce the OxUvA dataset and benchmark for evaluating single-object tracking algorithms. Benchmarks have enabled great strides in the field of object tracking by defining standardized evaluations on large sets of diverse videos. However, these works have focused exclusively on sequences that are just tens of seconds in length and in which the...
Preprint
Distinction among nearby poses and among symmetries of an object is challenging. In this paper, we propose a unified, group-theoretic approach to tackle both. Different from existing works which directly predict absolute pose, our method measures the pose of an object relative to another pose, i.e., the pose difference. The proposed method generate...
Preprint
Full-text available
Visual repetition is ubiquitous in our world. It appears in human activity (sports, cooking), animal behavior (a bee's waggle dance), natural phenomena (leaves in the wind) and in urban environments (flashing lights). Estimating visual repetition from realistic video is challenging as periodic motion is rarely perfectly static and stationary. To be...
Article
Full-text available
We introduce a new video dataset and benchmark to assess single-object tracking algorithms. Benchmarks have enabled great strides in the field of object tracking by defining standardized evaluations on large sets of diverse videos. However, these works have focused exclusively on sequences only few tens of seconds long, and where the target object...
Article
Full-text available
This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of s...
Article
Full-text available
This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a no...
Article
Full-text available
This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all...
Article
Full-text available
We consider the problem of estimating repetition in video, such as performing push-ups, cutting a melon or playing violin. Existing work shows good results under the assumption of static and stationary periodicity. As realistic video is rarely perfectly static and stationary, the often preferred Fourier-based measurements is inapt. Instead, we adop...
Article
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper w...
Article
Long-term tracking requires extreme stability to the multitude of model updates and robustness to the disappearance and loss of the target as such will inevitably happen. For motivation, we have taken 10 randomly selected OTB-sequences, doubled each by attaching a reversed version and repeated each double sequence 20 times. On most of these repetit...
Preprint
Convolutional neural networks (CNNs) have recently emerged as promising models of human vision based on their ability to predict hemodynamic brain responses to visual stimuli measured with functional magnetic resonance imaging (fMRI). However, the degree to which CNNs can predict temporal dynamics of visual object recognition reflected in neural me...
Article
Light source position estimation is a difficult yet an important problem in computer vision. A common approach for estimating the light source position (LSP) assumes Lambert’s law. However, in real-world scenes, Lambert’s law does not hold for all different types of surfaces. Instead of assuming all that surfaces follow Lambert’s law, our approach...
Conference Paper
Event detection in unconstrained videos is conceived as a content-based video retrieval with two modalities: textual and visual. Given a text describing a novel event, the goal is to rank related videos accordingly. This task is zero-exemplar, no video examples are given to the novel event. Related works train a bank of concept detectors on extern...
Conference Paper
For humans, one picture usually suffices to identify an object of search. I am looking for this little girl, have you seen her? or Do you have such another one? are two ways to specify a target even to someone who has never seen the object of search before. Searching from one example in digital multimedia retrieval is a hard problem. From the one e...
Article
Filters in convolutional networks are typically parameterized in a pixel basis, that does not take prior knowledge about the visual world into account. We investigate the generalized notion of frames, that can be designed with image properties in mind, as alternatives to this parametrization. We show that frame-based ResNets and Densenets can impro...
Article
Full-text available
Event detection in unconstrained videos is conceived as a content-based video retrieval with two modalities: textual and visual. Given a text describing a novel event, the goal is to rank related videos accordingly. This task is zero-exemplar, no video examples are given to the novel event. Related works train a bank of concept detectors on externa...
Article
Deep neural network algorithms are difficult to analyze because they lack structure allowing to understand the properties of underlying transforms and invariants. Multiscale hierarchical convolutional networks are structured deep convolutional networks where layers are indexed by progressively higher dimensional attributes, which are learned from t...
Article
Text in natural images typically adds meaning to an object or scene. In particular, text specifies which business places serve drinks (e.g. cafe, teahouse) or food (e.g. restaurant, pizzeria), and what kind of service is provided (e.g. massage, repair). The mere presence of text, its words and meaning are closely related to the semantics of the obj...
Article
This paper explores new evaluation perspectives for image captioning and introduces a noun translation task that achieves comparative image caption generation performance by translating from a set of nouns to captions. This implies that in image captioning, all word categories other than nouns can be evoked by a powerful language model without sacr...
Article
Full-text available
In this paper we propose to represent a scene as an abstraction of 'things'. We start from 'things' as generated by modern object proposals, and we investigate their immediately observable properties: position, size, aspect ratio and color, and those only. Where the recent successes and excitement of the field lie in object identification, we repre...
Article
Full-text available
A number of recent studies have shown that deep neural networks (DNN) map to the human visual hierarchy. However, based on a large number of subjects and accounting for the correlations between DNN layers, we show that there is no one-to-one mapping of DNN layers to the human visual system. This suggests that the depth of DNN, which is also critica...
Article
Full-text available
In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation strategy. Inspired by interactive segmentation, for each automatically placed bounding-box we compute a precise segmentation mask. We introduce diversity in segmentation strategies enhancing a generic model p...
Article
This paper aims for generic instance search from one example where the instance can be an arbitrary object like shoes, not just near-planar and one-sided instances like buildings and logos. First, we evaluate state-of-the-art instance search methods on this problem. We observe that what works for buildings loses its generality on shoes. Second, we...
Article
In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTu...
Article
Learning powerful feature representations with CNNs is hard when training data are limited. Pre-training is one way to overcome this, but it requires large datasets sufficiently similar to the target domain. Another option is to design priors into the model, which can range from tuned hyperparameters to fully engineered representations like Scatter...
Article
This work considers the task of object proposal scoring by integrating the consistency between state-of-the-art object proposal algorithms. It represents a novel way of thinking about proposals, as it starts with the assumption that consistent proposals are most likely centered on objects in the image. We pose the box-consistency problem as a large...
Article
Biologically inspired computational models replicate the hierarchical visual processing in the human ventral stream. One such recent model, Convolutional Neural Network (CNN) has achieved state of the art performance on automatic visual recognition tasks. The CNN architecture contains successive layers of convolution and pooling, and resembles the...