Goutam Bhat's research while affiliated with ETH Zurich and other places

Publications (33)

Preprint
Optimization based tracking methods have been widely successful by integrating a target model prediction module, providing effective global reasoning by minimizing an objective function. While this inductive bias integrates valuable domain knowledge, it limits the expressivity of the tracking network. In this work, we therefore propose a tracker ar...
Preprint
We propose a deep reparametrization of the maximum a posteriori formulation commonly employed in multi-frame image restoration tasks. Our approach is derived by introducing a learned error metric and a latent representation of the target image, which transforms the MAP objective to a deep feature space. The deep reparametrization allows us to direc...
Preprint
Full-text available
This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the fin...
Preprint
While single-image super-resolution (SISR) has attracted substantial interest in recent years, the proposed approaches are limited to learning image priors in order to add high frequency details. In contrast, multi-frame super-resolution (MFSR) offers the possibility of reconstructing rich details by combining signal information from multiple shift...
Preprint
Full-text available
Segmenting objects in videos is a fundamental computer vision task. The current deep learning based paradigm offers a powerful, but data-hungry solution. However, current datasets are limited by the cost and human effort of annotating object masks in videos. This effectively limits the performance and generalization capabilities of existing video s...
Chapter
While deep learning-based classification is generally tackled using standardized approaches, a wide variety of techniques are employed for regression. In computer vision, one particularly popular such technique is that of confidence-based regression, which entails predicting a confidence value for each input-target pair (x, y). While this approach...
Chapter
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable...
Chapter
Current state-of-the-art trackers rely only on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presenc...
Preprint
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture...
Preprint
Current state-of-the-art trackers only rely on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presenc...
Chapter
Full-text available
The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusin...
Preprint
Full-text available
While deep learning-based classification is generally addressed using standardized approaches, a wide variety of techniques are employed for regression. In computer vision, one particularly popular such technique is that of confidence-based regression, which entails predicting a confidence value for each input-target pair (x, y). While this approac...
Preprint
The current strive towards end-to-end trainable computer vision systems imposes major challenges for the task of visual tracking. In contrast to most other vision problems, tracking requires the learning of a robust target-specific appearance model online, during the inference stage. To be end-to-end trainable, the online learning of the target mod...
Chapter
Full-text available
The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popula...
Chapter
Trackers based on discriminative correlation filters (DCF) have recently seen widespread success and in this work we dive into their numerical core. DCF-based trackers interleave learning of the target detector and target state inference based on this detector. Whereas the original formulation includes a closed-form solution for the filter learning...
Preprint
While recent years have witnessed astonishing improvements in visual tracking robustness, the advancements in tracking accuracy have been severely limited. As the focus has been directed towards the development of powerful classifiers, the problem of accurate target state estimation has been largely overlooked. Instead, the majority of methods reso...
Chapter
In the field of generic object tracking numerous attempts have been made to exploit deep features. Despite all expectations, deep trackers are yet to reach an outstanding level of performance compared to methods solely based on handcrafted features. In this paper, we investigate this key issue and propose an approach to unlock the true potential of...
Article
Full-text available
In the field of generic object tracking numerous attempts have been made to exploit deep features. Despite all expectations, deep trackers are yet to reach an outstanding level of performance compared to methods solely based on handcrafted features. In this paper, we investigate this key issue and propose an approach to unlock the true potential of...
Article
Generic visual tracking is a challenging computer vision problem, with numerous applications. Most existing approaches rely on appearance information by employing either hand-crafted features or deep RGB features extracted from convolutional neural networks. Despite their success, these approaches struggle in case of ambiguous appearance informatio...
Conference Paper
Semantic segmentation of 3D point clouds is a challenging problem with numerous real-world applications. While deep learning has revolutionized the field of image semantic segmentation, its impact on point cloud data has been limited so far. Recent attempts, based on 3D deep learning approaches (3D-CNNs), have achieved below-expected results. Such...
Conference Paper
Visual object tracking performance has improved significantly in recent years. Most trackers are based on either of two paradigms: online learning of an appearance model or the use of a pre-trained object detector. Methods based on online learning provide high accuracy, but are prone to model drift. The model drift occurs when the tracker fails to...
Article
Full-text available
In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable par...

Citations

... Joint Visual Tracking and Segmentation A group of tracking methods have already identified the advantages of predicting a segmentation mask instead of a bounding box [53,59,46,31,50,41]. Siam-RCNN is a box-centric tracker that used a pretrained box2seg network to predict the segmentation mask given a bounding box prediction. ...
... Bhat et al. [2021a] learn a CNN with attention module to align, demosaick and super-resolve a burst of raw images. In a follow-up work, Bhat et al. [2021b] minimize a penalized energy including a data term comparing the sum of parameterized features residuals. Lecouat et al. [2021], learn instead a hybrid neural network alternating between aligning the images with the Lucas-Kanade algorithm [Lucas and Kanade 1981], predicting an HR image by solving a model-based least-squares problem and evaluating a learned prior function. ...
... Different from previous reset-based evaluation protocol [32], VOT2020 [31] proposes a new anchor-based evaluation protocol which is more reasonable. The same as STARK [58] and DualTFR [56], we use Alpha-Refine [59] to generate masks for evaluation since the ground truths of VOT2020 are annotated by the segmentation masks. ...
... Learning-based techniques. Bhat et al. [2021a] learn a CNN with attention module to align, demosaick and super-resolve a burst of raw images. In a follow-up work, Bhat et al. [2021b] minimize a penalized energy including a data term comparing the sum of parameterized features residuals. ...
... This challenge is one of the NTIRE 2022 associated challenges: spectral recovery [4], spectral demosaicing [3], perceptual image quality assessment [13], inpainting [39], night photography rendering [10], efficient super-resolution [30], learning the super-resolution space [31], super-resolution and quality enhancement of compressed video [49], high dynamic range [37], stereo super-resolution [47], burst super-resolution [6]. ...
... After the pre-training stage, the statistics of batch normalization layers are fixed and not updated during the RL fine-tuning stage. [20], UAV123 [27], and TNL2K [37], while strengthening the robustness (R) on VOT2020 [22]. Note that the re-initialization policy of VOT evaluation does not match with the reward system of SLT, which is designed for one-pass evaluation. ...
... The proposed algorithm outperforms the conventional algorithms with meaningful margins. Notice that Gustafsson et al. [17] and Berg et al. [3] employ the deeper ResNet50 [20] as their backbone networks, whereas the proposed ρ-regressors use VGG16. Nevertheless, the proposed algorithm improves the MAE score by more than 0.18. ...
... As a fundamental task in computer vision, tracking is widely used in many fields, such as video monitoring (Tian et al. 2011), robotics (Sakagami et al. 2002) and UAV vision (Du et al. 2018). Substantial progress has been achieved, mainly owing to deep feature extracting (Krizhevsky, Sutskever, and Hinton 2012;Szegedy et al. 2015;He et al. 2016;Vaswani et al. 2017), adaptive appearance modeling (Henriques et al. 2008;Danelljan et al. 2019;Bhat et al. 2019Bhat et al. , 2020, and correlation matching (Bertinetto et al. 2016;Li et al. 2018;Zhang et al. 2020). Meanwhile, challenges remain, arising from polytropic appearance and motion state, distractor interference, background clutter, etc. ...
... The shrinkage term s is collected together with the key k and the value v from the working and the long-term memory. 2 The collection is simply implemented as a concatenation in the last dimension: k = k w ⊕ k lt and v = v w ⊕ v lt , where superscripts 'w' and 'lt' denote working and long-term memory respectively. The working memory consists of key k w ∈ R C k ×T HW and value v w ∈ R C v ×T HW , where T is the number of working memory frames. ...
... Considering the 2-DoF translation determined by the position response, the search space of 5-DoF can be obtained. 3. The experimental results on seven benchmarks of OTB2015 [60], TColor-128 [42], UAV123 [50], POT [43] and VOT2016-2019 [32][33][34][35] show that our proposed method is superior to most advanced visual tracking algorithms using hand-crafted features. ...