# International Journal of Computer Vision

Online ISSN: 1573-1405
Print ISSN: 0920-5691
Recent publications
• Fangling Jiang
• Qi Li
• Pengcheng Liu
• [...]
• Zhenan Sun
Face anti-spoofing has been widely exploited in recent years to ensure security in face recognition systems; however, this technology suffers from poor generalization performance on unseen samples. Most previous methods align the marginal distributions from multiple source domains to learn domain-invariant features to mitigate domain shift. However, the category information of samples from different domains is ignored during these marginal distribution alignments; this can potentially lead to features of one category from one domain being misaligned to those of different categories from other domains, although the marginal distributions across domains are well aligned from the whole point of view. In this paper, we propose a simple but effective conditional domain adversarial framework whose main goal is to align the conditional distributions across domains to learn domain-invariant conditional features. Specifically, we first construct a parallel domain structure and its corresponding regularization to reduce negative influences from the finite samples and diversity of spoof face images on the conditional distribution alignments. Then, based on the parallel domain structure, a feature extractor and a global domain classifier, which play a conditional domain adversarial game, are leveraged to make the features of the same category across different domains indistinguishable. Moreover, intra-domain and cross-domain discrimination regularization are further exploited in conjunction with conditional domain adversarial training to minimize the classification error of class predictors. Extensive qualitative and quantitative experiments demonstrate that the proposed method learns well-generalized features from fewer source domains and achieves state-of-the-art performance on six public datasets.

Video Snapshot compressive imaging (SCI) is a promising technique to capture high-speed videos, which transforms the imaging speed from the detector to mask modulating and only needs a single measurement to capture multiple frames. The algorithm to reconstruct high-speed frames from the measurement plays a vital role in SCI. In this paper, we consider the promising reconstruction algorithm framework, namely plug-and-play (PnP), which is flexible to the encoding process comparing with other deep learning networks. One drawback of existing PnP algorithms is that they use a pre-trained denoising network as a plugged prior while the training data of the network might be different from the task in real applications. Towards this end, in this work, we propose the online PnP algorithm which can adaptively update the network’s parameters within the PnP iteration; this makes the denoising network more applicable to the desired data in the SCI reconstruction. Furthermore, for color video imaging, RGB frames need to be recovered from Bayer pattern or named demosaicing in the camera pipeline. To address this challenge, we design a two-stage reconstruction framework to optimize these two coupled ill-posed problems and introduce a deep demosaicing prior specifically for video demosaicing in SCI. Extensive results on both simulation and real datasets verify the superiority of our adaptive deep PnP algorithm. Code is available at https://github.com/xyvirtualgroup/AdaptivePnP_SCI.

Person search aims to simultaneously localize and identify a query person from uncropped images. To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN. Owing to the ROI-Align operation, this pipeline yields promising accuracy as re-id features are explicitly aligned with the corresponding object regions, but in the meantime, it introduces high computational overhead due to dense object anchors. In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs. First, we select an anchor-free detector (i.e., FCOS) as the prototype of our framework. Due to the lack of dense object anchors, it exhibits significantly higher efficiency compared with existing person search models. Second, when directly accommodating this anchor-free detector for person search, there exist several misalignment issues in different levels (i.e., scale, region, and task). To address these issues, we propose an aligned feature aggregation module to generate more discriminative and robust feature embeddings. Accordingly, we name our framework as Feature-Aligned Person Search Network (AlignPS). Third, by investigating the advantages of both anchor-based and anchor-free models, we further augment AlignPS with an ROI-Align head, which significantly improves the robustness of re-id features while still keeping our model highly efficient. Our framework not only achieves state-of-the-art or competitive performance on two challenging person search benchmarks, but can be also extended to other challenging searching tasks such as animal and object search. All the source codes, data, and trained models are available at: https://github.com/daodaofr/alignps.

A complete representation of 3D objects requires characterizing the space of deformations in an interpretable manner, from articulations of a single instance to changes in shape across categories. In this work, we improve on a prior generative model of geometric disentanglement for 3D shapes, wherein the space of object geometry is factorized into rigid orientation, non-rigid pose, and intrinsic shape. The resulting model can be trained from raw 3D shapes, without correspondences, labels, or even rigid alignment, using a combination of classical spectral geometry and probabilistic disentanglement of a structured latent representation space. Our improvements include more sophisticated handling of rotational invariance and the use of a diffeomorphic flow network to bridge latent and spectral space. The geometric structuring of the latent space imparts an interpretable characterization of the deformation space of an object. Furthermore, it enables tasks like pose transfer and pose-aware retrieval without requiring supervision. We evaluate our model on its generative modelling, representation learning, and disentanglement performance, showing improved rotation invariance and intrinsic-extrinsic factorization quality over the prior model.

Despite the significant improvements that self-supervised representation learning has led to when learning from unlabeled data, no methods have been developed that explain what influences the learned representation. We address this need through our proposed approach, RELAX, which is the first approach for attribution-based explanations of representations. Our approach can also model the uncertainty in its explanations, which is essential to produce trustworthy explanations. RELAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself, providing intuitive explanations that significantly outperform the gradient-based baselines. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning, providing insights into different learning strategies. Moreover, we conduct a user study to assess how well the proposed approach aligns with human intuition and show that the proposed method outperforms the baselines in both the quantitative and human evaluation studies. Finally, we illustrate the usability of RELAX in several use cases and highlight that incorporating uncertainty can be essential for providing faithful explanations, taking a crucial step towards explaining representations.

Graph convolution networks (GCNs) based methods for 3D human pose estimation usually aggregate immediate features of single-hop nodes, which are unaware of the correlation of multi-hop nodes and therefore neglect long-range dependency for predicting complex poses. In addition, they typically operate either on single-scale or sequential down-sampled multi-scale graph representations, resulting in the loss of contextual information or spatial details. To address these problems, this paper proposes a parallel hop-aware graph attention network (PHGANet) for 3D human pose estimation, which learns enriched hop-aware correlation of the skeleton joints while maintaining the spatially-precise representations of the human graph. Specifically, we propose a hop-aware skeletal graph attention (HSGAT) module to capture the semantic correlation of multi-hop nodes, which first calculates skeleton-based 1-hop attention and then disseminates it to arbitrary hops via graph connectivity. To alleviate the redundant noise introduced by the interactions with distant nodes, HSGAT uses an attenuation strategy to separate attention from distinct hops and assign them learnable attenuation weights according to their distances adaptively. Upon HSGAT, we further build PHGANet with multiple parallel branches of stacked HSGAT modules to learn the enriched hop-aware correlation of human skeletal structures at different scales. In addition, a joint centrality encoding scheme is proposed to introduce node importance as a bias in the learned graph representation, which makes the core joints (e.g., neck and pelvis) more influential during node aggregation. Experimental results indicate that PHGANet performs favorably against state-of-the-art methods on the Human3.6M and MPI-INF-3DHP benchmarks. Models and code are available at https://github.com/ChenyangWang95/PHGANet/.

Action recognition on extreme low-resolution videos, e.g., a resolution of 12×16 pixels, plays a vital role in far-view surveillance and privacy-preserving multimedia analysis. As low-resolution videos often only contain limited information, it is difficult for us to perform action recognition in them. Given the fact that one same action may be represented by videos in both high resolution (HR) and extreme low resolution (eLR), it is worth studying to utilize the relevant HR data to improve the eLR action recognition. In this work, we propose a novel Confident Spatial-Temporal Attention Transfer (CSTAT) for eLR action recognition. CSTAT acquires information from HR data by reducing the attention differences with a transfer-learning strategy. Besides, the confidence of the supervisory signal is also taken into consideration for a more reliable transferring process. Experimental results demonstrate that, the proposed method can effectively improve the accuracy of eLR action recognition and achieve state-of-the-art performances on 12×16 HMDB51, 12×16 Kinects-400, and 12×16 Something-Something v2.

The ability to capture detailed interactions among individuals in a social group is foundational to our study of animal behavior and neuroscience. Recent advances in deep learning and computer vision are driving rapid progress in methods that can record the actions and interactions of multiple individuals simultaneously. Many social species, such as birds, however, live deeply embedded in a three-dimensional world. This world introduces additional perceptual challenges such as occlusions, orientation-dependent appearance, large variation in apparent size, and poor sensor coverage for 3D reconstruction, that are not encountered by applications studying animals that move and interact only on 2D planes. Here we introduce a system for studying the behavioral dynamics of a group of songbirds as they move throughout a 3D aviary. We study the complexities that arise when tracking a group of closely interacting animals in three dimensions and introduce a novel dataset for evaluating multi-view trackers. Finally, we analyze captured ethogram data and demonstrate that social context affects the distribution of sequential interactions between birds in the aviary.

Mapping a truncated optimization method into a deep neural network, deep proximal unrolling network has attracted attention in compressive sensing due to its good interpretability and high performance. Each stage in such networks corresponds to one iteration in optimization. By understanding the network from the perspective of the human brain’s memory processing, we find there exist two categories of memory transmission: intra-stage and inter-stage. For intra-stage, existing methods increase the number of parameters to maximize the information flow. For inter-stage, there are also two methods. One is to transfer the information between adjacent stages, which can be regarded as short-term memory that is usually lost seriously. The other is a mechanism to ensure that the previous stages affect the current stage, which has not been explicitly studied. In this paper, a novel deep proximal unrolling network with persistent memory is proposed, dubbed deep Memory-Augmented Proximal Unrolling Network (MAPUN). We design a memory-augmented proximal mapping module that ensures maximum information flow for intra- and inter-stage. Specifically, we present a self-modulated block that can adaptively develop feature modulation for intra-stage and introduce two types of memory augmentation mechanisms for inter-stage, namely High-throughput Short-term Memory (HSM) and Cross-stage Long-term Memory (CLM). HSM is exploited to allow the network to transmit multi-channel short-term memory, which greatly reduces information loss between adjacent stages. CLM is utilized to develop the dependency of deep information across cascading stages, which greatly enhances network representation capability. Extensive experiments show that our MAPUN outperforms existing state-of-the-art methods.

We propose a novel minimal solver for sphere fitting via its 2D central projection, i.e., a special ellipse. The input of the presented algorithm consists of contour points detected in a camera image. General ellipse fitting problems require five contour points. However, taking advantage of the isotropic spherical target, three points are enough to define the tangent cone parameters of the sphere. This yields the sought ellipse parameters. Similarly, the sphere center can be estimated from the cone if the radius is known. These proposed geometric methods are rapid, numerically stable, and easy to implement. Experimental results—on synthetic, photorealistic, and real images—showcase the superiority of the proposed solutions to the state-of-the-art methods. A real-world LiDAR-camera calibration application justifies the utility of the sphere-based approach resulting in an error below a few centimeters.

In this paper, we investigate absolute and relative pose estimation under refraction, which are essential problems for refractive structure from motion. To cope with refraction effects, we first formulate geometric constraints for establishing iterative algorithms to optimize absolute and relative pose. By classifying two scenarios according to the geometric relationship between the camera and refractive interface, we derive the corresponding solutions to solve the optimization problems efficiently. In the scenario where the geometry between the camera and refractive interface is fixed (e.g., underwater imaging), we also show that the refractive epipolar constraint for relative pose can be established as a summation of the classical essential matrix and two correction terms caused by refraction by using the virtual camera transformation. Thanks to its succinct form, the resulting refractive epipolar constraint can be efficiently optimized. We evaluate our proposed algorithms on synthetic data showing superior accuracy and computational efficiency compared to state-of-the-art (SOTA) methods. We further demonstrate the application of the proposed algorithms in refractive structure from motion on real data. Our datasets (Hu et al., RefractiveSfM, https://github.com/diku-dk/RefractiveSfM, 2022) and code (Hu et al., DIKU Refractive Scenes Dataset 2022, Data, 2022) are publicly available.

Birds of prey rely on vision to execute flight manoeuvres that are key to their survival, such as intercepting fast-moving targets or navigating through clutter. A better understanding of the role played by vision during these manoeuvres is not only relevant within the field of animal behaviour, but could also have applications for autonomous drones. In this paper, we present a novel method that uses computer vision tools to analyse the role of active vision in bird flight, and demonstrate its use to answer behavioural questions. Combining motion capture data from Harris’ hawks with a hybrid 3D model of the environment, we render RGB images, semantic maps, depth information and optic flow outputs that characterise the visual experience of the bird in flight. In contrast with previous approaches, our method allows us to consider different camera models and alternative gaze strategies for the purposes of hypothesis testing, allows us to consider visual input over the complete visual field of the bird, and is not limited by the technical specifications and performance of a head-mounted camera light enough to attach to a bird’s head in flight. We present pilot data from three sample flights: a pursuit flight, in which a hawk intercepts a moving target, and two obstacle avoidance flights. With this approach, we provide a reproducible method that facilitates the collection of large volumes of data across many individuals, opening up new avenues for data-driven models of animal behaviour.

In-situ visual observations of marine organisms is crucial to developing behavioural understandings and their relations to their surrounding ecosystem. Typically, these observations are collected via divers, tags, and remotely-operated or human-piloted vehicles. Recently, however, autonomous underwater vehicles equipped with cameras and embedded computers with GPU capabilities are being developed for a variety of applications, and in particular, can be used to supplement these existing data collection mechanisms where human operation or tags are more difficult. Existing approaches have focused on using fully-supervised tracking methods, but labelled data for many underwater species are severely lacking. Semi-supervised trackers may offer alternative tracking solutions because they require less data than fully-supervised counterparts. However, because there are not existing realistic underwater tracking datasets, the performance of semi-supervised tracking algorithms in the marine domain is not well understood. To better evaluate their performance and utility, in this paper we provide (1) a novel dataset specific to marine animals located at http://warp.whoi.edu/vmat/, (2) an evaluation of state-of-the-art semi-supervised algorithms in the context of underwater animal tracking, and (3) an evaluation of real-world performance through demonstrations using a semi-supervised algorithm on-board an autonomous underwater vehicle to track marine animals in the wild.

Three-dimensional markerless pose estimation from multi-view video is emerging as an exciting method for quantifying the behavior of freely moving animals. Nevertheless, scientifically precise 3D animal pose estimation remains challenging, primarily due to a lack of large training and benchmark datasets and the immaturity of algorithms tailored to the demands of animal experiments and body plans. Existing techniques employ fully supervised convolutional neural networks (CNNs) trained to predict body keypoints in individual video frames, but this demands a large collection of labeled training samples to achieve desirable 3D tracking performance. Here, we introduce a semi-supervised learning strategy that incorporates unlabeled video frames via a simple temporal constraint applied during training. In freely moving mice, our new approach improves the current state-of-the-art performance of multi-view volumetric 3D pose estimation and further enhances the temporal stability and skeletal consistency of 3D tracking.

Each year, underwater remotely operated vehicles (ROVs) collect thousands of hours of video of unexplored ocean habitats revealing a plethora of information regarding biodiversity on Earth. However, fully utilizing this information remains a challenge as proper annotations and analysis require trained scientists’ time, which is both limited and costly. To this end, we present a Dataset for Underwater Substrate and Invertebrate Analysis (DUSIA), a benchmark suite and growing large-scale dataset to train, validate, and test methods for temporally localizing four underwater substrates as well as temporally and spatially localizing 59 underwater invertebrate species. DUSIA currently includes over ten hours of footage across 25 videos captured in 1080p at 30 fps by an ROV following pre-planned transects across the ocean floor near the Channel Islands of California. Each video includes annotations indicating the start and end times of substrates across the video in addition to counts of species of interest. Some frames are annotated with precise bounding box locations for invertebrate species of interest, as seen in Fig. 1. To our knowledge, DUSIA is the first dataset of its kind for deep sea exploration, with video from a moving camera, that includes substrate annotations and invertebrate species that are present at significant depths where sunlight does not penetrate. Additionally, we present the novel context-driven object detector (CDD) where we use explicit substrate classification to influence an object detection network to simultaneously predict a substrate and species class influenced by that substrate. We also present a method for improving training on partially annotated bounding box frames. Finally, we offer a baseline method for automating the counting of invertebrate species of interest.

We explore using body gestures for hidden emotional state analysis. As an important non-verbal communicative fashion, human body gestures are capable of conveying emotional information during social communication. In previous works, efforts have been made mainly on facial expressions, speech, or expressive body gestures to interpret classical expressive emotions. Differently, we focus on a specific group of body gestures, called micro-gestures (MGs), used in the psychology research field to interpret inner human feelings. MGs are subtle and spontaneous body movements that are proven, together with micro-expressions, to be more reliable than normal facial expressions for conveying hidden emotional information. In this work, a comprehensive study of MGs is presented from the computer vision aspect, including a novel spontaneous micro-gesture (SMG) dataset with two emotional stress states and a comprehensive statistical analysis indicating the correlations between MGs and emotional states. Novel frameworks are further presented together with various state-of-the-art methods as benchmarks for automatic classification, online recognition of MGs, and emotional stress state recognition. The dataset and methods presented could inspire a new way of utilizing body gestures for human emotion understanding and bring a new direction to the emotion AI community. The source code and dataset are made available: https://github.com/mikecheninoulu/SMG.

Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose two end-to-end video transformer based architectures, namely PhysFormer and PhysFormer++, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. To better exploit the temporal contextual and periodic rPPG clues, we also extend the PhysFormer to the two-pathway SlowFast based PhysFormer++ with temporal difference periodic and cross-attention transformers. Furthermore, we propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and PhysFormer++ and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. Unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer family can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community.

Zero-shot learning (ZSL) requires one to associate visual and semantic information observed from data of seen classes, so that test data of unseen classes can be recognized based on the described semantic representation. Aiming at synthesizing visual data from the given semantic inputs, hallucination-based ZSL approaches might suffer from mode collapse and biased problems due to the lack of ability in modeling the desirable visual features for unseen categories. In this paper, we present a generative model of Cross-Modal Consistency GAN (CMC-GAN), which performs semantics-guided intra-category knowledge transfer across image categories, so that data hallucination for unseen classes can be achieved with proper semantics and sufficient visual diversity. In our experiments, we perform standard and generalized ZSL on four benchmark datasets, confirming the effectiveness of our approach over that of state-of-the-art ZSL methods.

Modern image-based deblurring methods usually show degenerate performance in low-light conditions since the images often contain most of the poorly visible dark regions and a few saturated bright regions, making the amount of effective features that can be extracted for deblurring limited. In contrast, event cameras can trigger events with a very high dynamic range and low latency, which hardly suffer from saturation and naturally encode dense temporal information about motion. However, in low-light conditions existing event-based deblurring methods would become less robust since the events triggered in dark regions are often severely contaminated by noise, leading to inaccurate reconstruction of the corresponding intensity values. Besides, since they directly adopt the event-based double integral model to perform pixel-wise reconstruction, they can only handle low-resolution grayscale active pixel sensor images provided by the DAVIS camera, which cannot meet the requirement of daily photography. In this paper, to apply events to deblurring low-light images robustly, we propose a unified two-stage framework along with a motion-aware neural network tailored to it, reconstructing the sharp image under the guidance of high-fidelity motion clues extracted from events. Besides, we build an RGB-DAVIS hybrid camera system to demonstrate that our method has the ability to deblur high-resolution RGB images due to the natural advantages of our two-stage framework. Experimental results show our method achieves state-of-the-art performance on both synthetic and real-world images.

We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of the image or video, our new task automatically analyzes the inherent sentiment tendency from the visual content. With this in mind, we propose a multimodal Transformer model namely Senti-Transformer for sentimental visual captioning, which integrates both content and sentiment information from multiple modalities and incorporates prior sentimental knowledge to generate sentimental sentence. Specifically, we extract prior knowledge from sentimental corpus to obtain sentimental textual information and design a multi-head Transformer encoder to encode multimodal features. Then we decompose the attention layer in the middle of Transformer decoder to focus on important features of each modality, and the attended features are integrated through an intra- and inter-modality fusion mechanism for generating sentimental sentences. To effectively train the proposed model using the external sentimental corpus as well as the paired images or videos and factual sentences in existing captioning datasets, we propose a two-stage training strategy that first learns to incorporate sentimental elements into the sentences via a regularization term and then learns to generate fluent and relevant sentences with the inherent sentimental styles via reinforcement learning with a sentimental reward. Extensive experiments on both image and video datasets demonstrate the effectiveness and superiority of our Senti-Transformer on sentimental visual captioning. Source code is available at https://github.com/ezeli/InSentiCap_ext.

Lifting the 2D human pose to the 3D pose is an important yet challenging task. Existing 3D human pose estimation suffers from (1) the inherent ambiguity between the 2D and 3D data, and (2) the lack of well-labeled 2D–3D pose pairs in the wild. Human beings are able to imagine the 3D human pose from a 2D image or a set of 2D body key-points with the least ambiguity, which should be attributed to the prior knowledge of the human body that we have acquired in our mind. Inspired by this, we propose a new framework that leverages the labeled 3D human poses to learn a 3D concept of the human body to reduce ambiguity. To have consensus on the body concept from the 2D pose, our key insight is to treat the 2D human pose and the 3D human pose as two different domains. By adapting the two domains, the body knowledge learned from 3D poses is applied to 2D poses and guides the 2D pose encoder to generate informative 3D “imagination” as an embedding in pose lifting. Benefiting from the domain adaptation perspective, the proposed framework unifies the supervised and semi-supervised 3D pose estimation in a principled framework. Extensive experiments demonstrate that the proposed approach can achieve state-of-the-art performance on standard benchmarks. More importantly, it is validated that the explicitly learned 3D body concept effectively alleviates the 2D–3D ambiguity, improves the generalization, and enables the network to leverage the abundant unlabeled 2D data.

In recent years, 3D data has been widely used in archaeology and in the field of conservation and restoration of cultural properties. Virtual restoration, which reconstructs the original state in virtual space, is one of the promising applications utilizing 3D scanning data. Though many studies of virtual restoration have been conducted, it is still challenging to restore the cultural properties that consist of multiple deformable components because it is not feasible to identify the original shape uniquely. As a solution to such a problem, we proposed a non-rigid 3D shape assembly method for virtually restoring wooden ships that are composed of multiple timbers. The deformed timber can be well represented by ruled surface. We proposed a free-form deformation method with a ruled surface and an assembly method to align the deformable components mutually. The method employs a bottom-up approach that does not require reference data for target objects. The proposed framework narrows down the searching space for optimization using the physical constraints of wooden materials, and it can also obtain optimal solutions. We also showed an experimental result, where we virtually restored King Khufu’s first solar boat. The boat was originally constructed by assembling several timbers. The boat was reconstructed as a real object and is currently exhibited at a museum. However, unfortunately, the entire shape of the boat is slightly distorted. We applied the proposed method using archaeological knowledge and then showed the virtual restoration results using the acquired 3D data of the boat’s components.

Neural Architecture Search (NAS) has demonstrated state-of-the-art performance on various computer vision tasks. Despite the superior performance achieved, the efficiency and generality of existing methods are highly valued due to their high computational complexity and low generality. In this paper, we propose an efficient and unified NAS framework termed DDPNAS via dynamic distribution pruning, facilitating a theoretical bound on accuracy and efficiency. In particular, we first sample architectures from a joint categorical distribution. Then the search space is dynamically pruned and its distribution is updated every few epochs. With the proposed efficient network generation method, we directly obtain the optimal neural architectures on given constraints, which is practical for on-device models across diverse search spaces and constraints. The architectures searched by our method achieve remarkable top-1 accuracies, 97.56 and 77.2 on CIFAR-10 and ImageNet (mobile settings), respectively, with the fastest search process, i.e., only 1.8 GPU hours on a Tesla V100. Codes for searching and network generation are available at: https://openi.pcl.ac.cn/PCL_AutoML/XNAS.

Fine-grained classification with few labeled samples has urgent needs in practice since fine-grained samples are more difficult and expensive to collect and annotate. Standard few-shot learning (FSL) focuses on generalising across seen and unseen classes, where the classes are at the same level of granularity. Therefore, when applying existing FSL methods to tackle this problem, large amounts of labeled samples for some fine-grained classes are required. Since samples of coarse-grained classes are much cheaper and easier to obtain, it is desired to learn knowledge from coarse-grained categories that can be transferred to fine-grained classes with a few samples. In this paper, we propose a novel learning problem called cross-granularity few-shot learning (CG-FSL), where sufficient samples of coarse-grained classes are available for training, but in the test stage, the goal is to classify the fine-grained subclasses. This learning paradigm follows the laws of cognitive neurology. We first give an analysis of CG-FSL through the Structural Causal Model (SCM) and figure out that the standard FSL model learned at the coarse-grained level is actually a confounder. We thus perform backdoor adjustment to decouple the interferences and consequently derive a causal CG-FSL model called Meta Attention-Generation Network (MAGN), which is trained in a bilevel optimization manner. We construct benchmarks from several fine-grained image datasets for the CG-FSL problem and empirically show that our model significantly outperforms standard FSL methods and baseline CG-FSL methods.

This paper introduces a set of numerical methods for Riemannian shape analysis of 3D surfaces within the setting of invariant (elastic) second-order Sobolev metrics. More specifically, we address the computation of geodesics and geodesic distances between parametrized or unparametrized immersed surfaces represented as 3D meshes. Building on this, we develop tools for the statistical shape analysis of sets of surfaces, including methods for estimating Karcher means and performing tangent PCA on shape populations, and for computing parallel transport along paths of surfaces. Our proposed approach fundamentally relies on a relaxed variational formulation for the geodesic matching problem via the use of varifold fidelity terms, which enable us to enforce reparametrization independence when computing geodesics between unparametrized surfaces, while also yielding versatile algorithms that allow us to compare surfaces with varying sampling or mesh structures. Importantly, we demonstrate how our relaxed variational framework can be extended to tackle partially observed data. The different benefits of our numerical pipeline are illustrated over various examples, synthetic and real.

We propose a novel end-to-end curriculum learning approach for sparsely labelled animal datasets leveraging large volumes of unlabelled data to improve supervised species detectors. We exemplify the method in detail on the task of finding great apes in camera trap footage taken in challenging real-world jungle environments. In contrast to previous semi-supervised methods, our approach adjusts learning parameters dynamically over time and gradually improves detection quality by steering training towards virtuous self-reinforcement. To achieve this, we propose integrating pseudo-labelling with curriculum learning policies and show how learning collapse can be avoided. We discuss theoretical arguments, ablations, and significant performance improvements against various state-of-the-art systems when evaluating on the Extended PanAfrican Dataset holding approx. 1.8M frames. We also demonstrate our method can outperform supervised baselines with significant margins on sparse label versions of other animal datasets such as Bees and Snapshot Serengeti. We note that performance advantages are strongest for smaller labelled ratios common in ecological applications. Finally, we show that our approach achieves competitive benchmarks for generic object detection in MS-COCO and PASCAL-VOC indicating wider applicability of the dynamic learning concepts introduced. We publish all relevant source code, network weights, and data access details for full reproducibility.

Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.

Face frontalization consists of synthesizing a frontal view from a profile one. This paper proposes a frontalization method that preserves non-rigid facial deformations, i.e. facial expressions. It is shown that expression-preserving frontalization boosts the performance of visually assisted speech processing. The method alternates between the estimation of (i) the rigid transformation (scale, rotation, and translation) and (ii) the non-rigid deformation between an arbitrarily-viewed face and a face model. The method has two important merits: it can deal with non-Gaussian errors in the data and it incorporates a dynamical face deformation model. For that purpose, we use the Student’s t-distribution in combination with a Bayesian filter in order to account for both rigid head motions and time-varying facial deformations, e.g. caused by speech production. The zero-mean normalized cross-correlation score is used to evaluate the ability of the method to preserve facial expressions. The method is thoroughly evaluated and compared with several state of the art methods, either based on traditional geometric models or on deep learning. Moreover, we show that the method, when incorporated into speech processing pipelines, improves word recognition rates and speech intelligibility scores by a considerable margin.

The potential of video surveillance can be further explored by using mobile cameras. Drone-mounted cameras at a high altitude can provide top views of a scene from a global perspective while cameras worn by people on the ground can provide first-person views of the same scene with more local details. To relate these two views for collaborative analysis, we propose to localize the field of view of the first-person-view cameras in the global top view. This is a very challenging problem due to their large view differences and indeterminate camera motions. In this work, we explore the use of sunlight direction as a bridge to relate the two views. Specifically, we design a shadow-direction-aware network to simultaneously locate the shadow vanishing point in the first-person view as well as the shadow direction in the top view. Then we apply multi-view geometry to estimate the yaw and pitch angles of the first-person-view camera in the top view. We build a new synthetic dataset consisting of top-view and first-person-view image pairs for performance evaluation. Quantitative results on this synthetic dataset show the superiority of our method compared with the existing methods, which achieve the view angle estimation errors of 1.61∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document} (pitch angle) and 15.13∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document} (yaw angle), respectively. The qualitative results on real images also show the effectiveness of the proposed method.

In computer vision, some attribution methods for explaining CNNs attempt to study how the intermediate features affect network prediction. However, they usually ignore the feature hierarchies among the intermediate features. This paper introduces a hierarchical decomposition framework to explain CNN’s decision-making process in a top-down manner. Specifically, we propose a gradient-based activation propagation (gAP) module that can decompose any intermediate CNN decision to its lower layers and find the supporting features. Then we utilize the gAP module to iteratively decompose the network decision to the supporting evidence from different CNN layers. The proposed framework can generate a deep hierarchy of strongly associated supporting evidence for the network decision, which provides insight into the decision-making process. Moreover, gAP is effort-free for understanding CNN-based models without network architecture modification and extra training processes. Experiments show the effectiveness of the proposed method. The data and source code will be publicly available at https://mmcheng.net/hdecomp/.

Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are often coupled with the lack of example pairs, which inhibits the application of supervised learning strategies. To address these challenges, we propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples. In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information. The proposed design allows our recurrent cells to efficiently propagate spatio-temporal information across frames and reduces the need for high complexity networks. Our setting enables learning from unpaired videos in a cyclic adversarial manner, where the proposed recurrent units are employed in all architectures. Efficient training is accomplished by introducing one single discriminator that learns the joint distribution of source and target domain simultaneously. The enhancement results demonstrate clear superiority of the proposed video enhancer over the state-of-the-art methods, in all terms of visual quality, quantitative metrics, and inference speed. Notably, our video enhancer is capable of enhancing over 35 frames per second of FullHD video (1080x1920).

The development of neural relighting techniques has by far outpaced the rate of their corresponding training data ( e.g., OLAT) generation. For example, high-quality relighting from a single portrait image still requires supervision from comprehensive datasets covering broad diversities in gender, race, complexion, and facial geometry. We present a hybrid parametric neural relighting (PN-Relighting) framework for single portrait relighting, using a much smaller OLAT dataset or SMOLAT. At the core of PN-Relighting, we employ parametric 3D faces coupled with appearance inference and implicit material modelling to enrich SMOLAT for handling in-the-wild images. Specifically, we tailor an appearance inference module to generate detailed geometry and albedo on top of the parametric face and develop a neural rendering module to first construct an implicit material representation from SMOLAT and then conduct self-supervised training on in-the-wild image datasets. Comprehensive experiments show that PN-Relighting produces comparable high-quality relighting to TotalRelighting (Pandey et al., 2021), but with a smaller dataset. It further improves shape estimation and naturally supports free-viewpoint rendering and partial skin material editing. PN-Relighting also serves as a data augmenter to produce rich OLAT datasets beyond the original capture.

We give robust recovery results for synchronization on the rotation group, SO(D). In particular, we consider an adversarial corruption setting, where a limited percentage of the observations are arbitrarily corrupted. We develop a novel algorithm that exploits Tukey depth in the tangent space of SO(D). This algorithm, called Depth Descent Synchronization, exactly recovers the underlying rotations up to an outlier percentage of 1/(D(D-1)+2), which corresponds to 1/4 for SO(2) and 1/8 for SO(3). In the case of SO(2), we demonstrate that a variant of this algorithm converges linearly to the ground truth rotations. We implement this algorithm for the case of SO(3) and demonstrate that it performs competitively on baseline synthetic data.

Recent advancements in deep learning have enabled 3D human body reconstruction from a monocular image, which has broad applications in multiple domains. In this paper, we propose SHARP (SHape Aware Reconstruction of People in loose clothing), a novel end-to-end trainable network that accurately recovers the 3D geometry and appearance of humans in loose clothing from a monocular image. SHARP uses a sparse and efficient fusion strategy to combine parametric body prior with a non-parametric 2D representation of clothed humans. The parametric body prior enforces geometrical consistency on the body shape and pose, while the non-parametric representation models loose clothing and handles self-occlusions as well. We also leverage the sparseness of the non-parametric representation for faster training of our network while using losses on 2D maps. Another key contribution is 3DHumans, our new life-like dataset of 3D human body scans with rich geometrical and textural details. We evaluate SHARP on 3DHumans and other publicly available datasets, and show superior qualitative and quantitative performance than existing state-of-the-art methods.

Existing state-of-the-art 3D point clouds understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learnt 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learnt 3D descriptors and learnt semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.

With the development of computer vision technology, many advanced computer vision methods have been successfully applied to animal detection, tracking, recognition and behavior analysis, which is of great help to ecological protection, biodiversity conservation and environmental protection. As existing datasets applied to target tracking contain various kinds of common objects, but rarely focus on wild animals, this paper proposes the first benchmark, named Wild Animal Tracking Benchmark (WATB), to encourage further progress of research and applications of visual object tracking. WATB contains more than 203,000 frames and 206 video sequences, and covers different kinds of animals from land, sea and sky. The average length of the videos is over 980 frames. Each video is manually labelled with thirteen challenge attributes including illumination variation, rotation, deformation, and so on. In the dataset, all frames are annotated with axis-aligned bounding boxes. To reveal the performance of these existing tracking algorithms and provide baseline results for future research on wild animal tracking, we benchmark a total of 38 state-of-the-art trackers and rank them according to tracking accuracy. Evaluation results demonstrate that the trackers based on deep networks perform much better than other trackers like correlation filters. Another finding on the basis of the evaluation results is that wild animals tracking is still a big challenge in computer vision community. The benchmark WATB and evaluation results are released on the project website https://w-1995.github.io/.

We develop in this paper a new formulation for the calibration and pose estimation of three-layer flat-refractive geometry. We then extend it into the general n-layer case. We show that contrary to state-of-the-art theory, our new forms yield stable solutions under the presence of high levels of noise while improving the accuracy estimates by two orders of magnitude or more. Our closed-form expressions facilitate a direct and linear solution of the pose parameters and need no a priori knowledge. In developing our forms, we provide new insights into the nature of such systems, specifically showing that the effect of refraction on the system is captured by a single surface replacing all layers. By characterizing this surface form, we solve the n-layer problem using a fixed set of parameters, irrespective of the number of layers. Finally, we identify a new configuration in which the camera obtains central-perspective properties. Such configuration contrasts the axial nature of this system and can simplify subsequent processing. Our analyses demonstrate that reaching the levels of accuracies we record requires only simple means, e.g., a single image of a planar target. Hence, further to advancing the theory of flat-refractive geometry, we also provide a viable framework for its application.

Recent years have witnessed growing interests in RGB-D Salient Object Detection (SOD), benefiting from the ample spatial layout cues embedded in depth maps to help SOD models distinguish salient objects from complex backgrounds or similar surroundings. Despite these progresses, this emerging line of research has been considerably hindered by the noise and ambiguity that prevail in raw depth images, as well as the coarse object boundaries in saliency predictions. To address the aforementioned issues, we propose a Depth Calibration and Boundary-aware Fusion (DCBF) framework that contains two novel components: (1) a learning strategy to calibrate the latent bias in the original depth maps towards boosting the SOD performance; (2) a boundary-aware multimodal fusion module to fuse the complementary cues from RGB and depth channels, as well as to improve object boundary qualities. In addition, we introduce a new saliency dataset, HiBo-UA, which contains 1515 high-resolution RGB-D images with finely-annotated pixel-level labels. To our best knowledge, this is the first RGB-D-based high-resolution saliency dataset with significantly higher image resolution (nearly 7$$\times$$) than the widely used DUT-D dataset. The proposed high-resolution dataset with richer object boundary details is capable of accurately assessing the performance of various saliency models, in order to retain fine-grained object boundaries. It also facilitates the growing need of our research community in accessing higher-resolution data. Extensive empirical experiments demonstrate the superior performance of our approach against 31 state-of-the-art methods. It is worth noting that our calibrated depth alone can work in a plug-and-play manner; empirically it is shown to bring noticeable improvements when applied to existing state-of-the-art RGB-D SOD models.

This paper aims to craft adversarial queries for image retrieval, which uses image features for similarity measurement. Many commonly used methods are developed in the context of image classification. However, these methods, which attack prediction probabilities, only exert an indirect influence on the image features and are thus found less effective when being applied to the retrieval problem. In designing an attack method specifically for image retrieval, we introduce opposite-direction feature attack (ODFA), a white-box attack approach that directly attacks query image features to generate adversarial queries. As the name implies, the main idea underpinning ODFA is to impel the original image feature to the opposite direction, similar to a U-turn. This simple idea is experimentally evaluated on five retrieval datasets. We show that the adversarial queries generated by ODFA cause true matches no longer to be seen at the top ranks, and the attack success rate is consistently higher than classifier attack methods. In addition, our method of creating adversarial queries can be extended for multi-scale query inputs and is generalizable to other retrieval models without foreknowing their weights, i.e., the black-box setting.

Relating behavior to brain activity in animals is a fundamental goal in neuroscience, with practical applications in building robust brain-machine interfaces. However, the domain gap between individuals is a major issue that prevents the training of general models that work on unlabeled subjects. Since 3D pose data can now be reliably extracted from multi-view video sequences without manual intervention, we propose to use it to guide the encoding of neural action representations together with a set of neural and behavioral augmentations exploiting the properties of microscopy imaging. To test our method, we collect a large dataset that features flies and their neural activity. To reduce the domain gap, during training, we mix features of neural and behavioral data across flies that seem to be performing similar actions. To show our method can generalize further neural modalities and other downstream tasks, we test our method on a human neural Electrocorticography dataset, and another RGB video data of human activities from different viewpoints. We believe our work will enable more robust neural decoding algorithms to be used in future brain-machine interfaces.

Depth completion aims to predict a dense depth map from a sparse depth input. The acquisition of dense ground-truth annotations for depth completion settings can be difficult and, at the same time, a significant domain gap between real LiDAR measurements and synthetic data has prevented from successful training of models in virtual settings. We propose a domain adaptation approach for sparse-to-dense depth completion that is trained from synthetic data, without annotations in the real domain or additional sensors. Our approach simulates the real sensor noise in an RGB + LiDAR set-up, and consists of three modules: simulating the real LiDAR input in the synthetic domain via projections, filtering the real noisy LiDAR for supervision and adapting the synthetic RGB image using a CycleGAN approach. We extensively evaluate these modules in the KITTI depth completion benchmark.

Accurate real depth annotations are difficult to acquire, needing the use of special devices such as a LiDAR sensor. Self-supervised methods try to overcome this problem by processing video or stereo sequences, which may not always be available. Instead, in this paper, we propose a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. Our approach is evaluated on standard domain adaptation benchmarks for monocular depth estimation and show consistent improvement upon the state-of-the-art. Code available at https://github.com/alopezgit/DESC.

In this work, we focus on outdoor lighting estimation by aggregating individual noisy estimates from images, exploiting the rich image information from wide-angle cameras and/or temporal image sequences. Photographs inherently encode information about the lighting of the scene in the form of shading and shadows. Recovering the lighting is an inverse rendering problem and as that ill-posed. Recent research based on deep neural networks has shown promising results for estimating light from a single image, but with shortcomings in robustness. We tackle this problem by combining lighting estimates from several image views sampled in the angular and temporal domains of an image sequence. For this task, we introduce a transformer architecture that is trained in an end-2-end fashion without any statistical post-processing as required by previous work. Thereby, we propose a positional encoding that takes into account camera alignment and ego-motion estimation to globally register the individual estimates when computing attention between visual words. We show that our method leads to improved lighting estimation while requiring fewer hyperparameters compared to the state of the art.