Article

Deep High-Resolution Representation Learning for Visual Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The main multi-person pose estimation networks proposed to date include Deepcut [15], HRNet [16], Personlab [17] and others. Deepcut proposes a method that solves detection and pose estimation together: first, all keypoints in the image are extracted as nodes to form a density map by a CNN, then body positions are obtained by a Faster R-CNN combined with the density map, and finally each person's keypoints are classified separately. ...
... By introducing a lightweight C-UIB module, the EPLC Pose network reduced the detection time of single frame images by 16.91% compared to the benchmark network. In addition, the detection time of 8.475ms for single frame images is significantly lower than the maximum threshold of real-time video processing (about 100ms), which verifies the feasibility of real-time application of EPLC Pose network in smart classrooms. ...
... In order to investigate the superiority of the improved EPLC-Pose network over currently popular classroom behaviour detection networks and multi-person pose recognition networks, this paper compares EPLC-Pose with these networks (YOLOv5 [32], YOLOv7 [33], YOLOv8pose [28], YOLOv10 [34], G-RMI [35], HRNet [16], DERK [36], YOLO-Pose [37]), and the experimental results are shown in Table IV and Table V. ...
Article
Full-text available
With the development of intelligence, concepts such as "smart classroom" have emerged, and the intelligent recognition of classroom behavior has become a hot research topic. However, the current intelligent recognition of classroom behavior still faces challenges such as complex single-person poses, severe occlusions, sampling from non-real teaching classrooms, and the lack of lightweight recognition networks. We aim to use deep learning-based human pose estimation methods to accurately estimate and predict student behaviors in real classrooms and design a network that prioritizes lightweight architecture. Therefore, in this paper, we propose a lightweight network called EPLC-Pose (Efficient Panoramic Lightweight Classroom Pose Network), a novel architectural approach based on the YOLOv8-Pose framework. The network’s backbone and neck components have been enhanced by replacing the traditional C2F module with the innovative C-UIB module. Due to the advanced design of the C-UIB module, the number of parameters has been significantly reduced. Additionally, to address the issue of limb occlusion, the SEAM attention mechanism is introduced in the Neck part. To evaluate our method, we have created a comprehensive panoramic classroom behavior pose dataset (CPKD) consisting of 6000 images. The network has shown competitive results on both the CPKD and MS COCO datasets. Compared to the baseline model, our approach achieves a 46.37% reduction in parameters, a 48.57% decrease in GFLOPS, and an improvement in mAP50 by 1% and 0.4% respectively.
... HRNet HRNet [27] begins with a high-resolution convolutional stem in its initial stage and progressively incorporates high-to-low resolution streams in subsequent stages. These multi-resolution streams run parallel to each other. ...
... These multi-resolution streams run parallel to each other. The core of HRNet [27] comprises a series of stages where information is repeatedly exchanged between different resolutions. Fig. 3 illustrates the structure of HRNet. ...
... To enhance the feature extraction process, we replace the 3 × 3 convolutions in the downsampling blocks (from high resolution to low resolution) with ShuffleBlock. Additionally, instead of using interpolation as in HRNet [27], we employ 3 × 3 deconvolution layers for upsampling from low resolution to high resolution. This approach enhances the detailed reconstruction of high-resolution features. ...
Article
Full-text available
Despite the significant advancements in learning-based stereo matching algorithms, a significant challenge remains: the high computational cost and memory demands of 3D convolutions, which hinder real-time deployment on resource-constrained platforms like edge devices. In this paper, we present a novel approach that completely avoids the use of 3D convolutions, with the goal of achieving faster inference speeds while maintaining comparable accuracy to existing state-of-the-art methods. Our proposed solution revolves around a 2D cost aggregation technique, which serves as a viable alternative to traditional 3D convolutions, delivering similar results in terms of accuracy. This innovative method significantly reduces computational overhead, facilitating more efficient resource utilization and paving the way for real-time applications. Complementing our 2D cost aggregation module, we introduce a multi-stage feature extractor, designed to enhance feature representation while remaining straightforward and lightweight. The integration of the 2D cost aggregation and multi-stage feature extraction results in an efficient architecture for cost aggregation, simplifying the model and ensuring computational efficiency without sacrificing accuracy. This framework delivers high-performance stereo matching suitable for devices with limited computational capabilities. Through evaluation on benchmark datasets, we demonstrate the effectiveness of our approach, highlighting its potential for real-time 3D perception in applications. By addressing the constraints imposed by 3D convolutions and offering a more pragmatic alternative, this work bridges the gap between high-performance stereo matching algorithms and the realities of edge computing environments.
... This surge is driven by the critical need for reliable performance in autonomous driving, as it directly impacts human safety. Nevertheless, previous approaches [15], [26], [27] relying solely on visual cues from RGB sensors often struggle in adverse scenarios such as low-light conditions, dense fog, and heavy rain. To tackle this issue, thermal sensors that capture the heat signatures of objects have been extensively utilized for achieving reliable semantic segmentation in challenging conditions [17], [18]. ...
... where H, W, N, and C denote the height, width, the number of pixels, and classes. Pseudo-labels y P L are generated by HRNet [26] pre-trained on a large-scale RGB dataset [2]. Cross-Spectral Prototypes. ...
... An overview of our framework. (a) In stage 1, both RGB and thermal networks are trained in the daytime using pseudo-labels y P L generated by HRNet [26] pre-trained on a large-scale RGB dataset [2]. Simultaneously, Masked Mutual Learning (MML) is applied between these student networks, and cross-spectral prototypes η RT are gradually updated during training time. ...
Preprint
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.
... However, the multi-stage downsampling operation leads to the gradual loss of spatial information, which is particularly crucial for boundary localization of buildings in dense scenes. Some studies have reduced the number of downsampling [50] [51] or relied on HRNet for information retention [52], which has achieved good results in segmentation tasks [53]. However, these networks do not retain rich spatial and scale information in the encoding stage. ...
... UNet [9] is a typical encoder-decoder structure. HRNet [52] is a multi-branch structure that has been widely applied in the remote sensing field. SegNet [76] is a network that records pooling indices for alignment. ...
Article
Full-text available
The rapid development of urban and rural construction has accelerated the demand for segmentation in dense building scenes. However, the issue of inaccurate building localization in such scenes still lacks effective solutions. One of the causes of this problem is the loss of high-frequency information and spatial misalignment caused by repeated sampling. To address this, this paper proposes the Spatial Enhancement and Recalibration Network (SERNet). SERNet divides the feature extraction process into three stages: spatial retention, enhancement, and recalibration. In the first stage, the designed Parallel Path Feature Extraction Architecture (PPFEA) is used to acquire deep semantic features by the spatial path to retain spatial information and contextual path to acquire deep semantic features. In the second stage, a spatial reinforcement module based on perceptual kernel (PKSRM) is proposed. This module predicts the grouping kernel using spatial path features, obtains the local perceptual kernel by intra-kernel similarity computation, and uses the kernel weights to strengthen the local detail information after grouping. In the third phase, the group bootstrap space calibration module (GBSCM) was designed. Set the grouping according to the difference in scale variation. Provide guidance information for bias prediction by calculating the difference degree of direction and distance between features, and finally realize high-resolution reconstruction of building features through level-by-level calibration. Tested on three building datasets, Massachusetts, WHU Aerial and WHU Satellite I, the IoU of this paper's method reaches 75.42%, 91.42% and 67.24% respectively. The relevant code is shown in https://github.com/khyibu/SERNet.
... Large-scale vision detection models are generally ineffective with small-scale training data, as these parameter-rich methods tend to overfit by memorizing the distribution of the small dataset. Therefore, small-scale detection models such as HRNet [15] are preferred. ...
... We employ different scales of ResNet [37], HRNet [15], and VisionTransformer [44] as our backbones to compare their ability to use generated images. Moreover, we use a topdown heatmap head introduced in [38], which contains some deconvolutional layers followed by a convolutional layer to generate heatmaps from low-resolution feature maps. ...
Preprint
Full-text available
Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: https://um-lab.github.io/cepha-generation
... Fusing information across different scales is also challenging. Some studies [21][22][23][24] focused on fusion methods. ...
Article
Full-text available
Image demoiréing is a complex image-restoration task because of the color and shape variations of moiré patterns. With the development of mobile devices, mobile phones can now be used to capture images at multiple resolutions. This difficulty increases when attempting to remove moiré from both low- and high-resolution images, as different resolutions make it challenging for existing methods to match the scales and textures of moiré. To solve these problems, we built a mixed attention residual module (MARM) by combining multi-scale feature extraction and mixed attention methods. Based on MARM, we propose a multi-scale adaptive mixed attention network (MA2Net) that can adapt to input images of different sizes and remove moiré of various shapes. Our model achieved the best results on four public datasets with resolutions ranging from 256×256 to 4k. Extensive experiments demonstrated the effectiveness of our model, which outperformed state-of-the-art methods by a large margin. We also conducted experiments on image deraining to validate the effectiveness of our model in other image-restoration tasks, and MA2Net achieved state-of-the-art performance on the Rain200H dataset.
... The key to completely utilizing the aids of the scope of 2D Haar transform is to maintain different resolutions through the pipelines. Inspired by HR-Net [38], the module's shape contains several branches, each counting for different resolutions. It starts at a low-resolution branch and gradually expands the number of high-resolution branches. ...
Preprint
Full-text available
The 3D human pose is vital for modern computer vision and computer graphics, and its prediction has drawn attention in recent years. 3D human pose prediction aims at forecasting a human's future motion from the previous sequence. Ignoring that the arbitrariness of human motion sequences has a firm origin in transition in both temporal and spatial axes limits the performance of state-of-the-art methods, leading them to struggle with making precise predictions on complex cases, e.g., arbitrarily posing or greeting. To alleviate this problem, a network called HaarMoDic is proposed in this paper, which utilizes the 2D Haar transform to project joints to higher resolution coordinates where the network can access spatial and temporal information simultaneously. An ablation study proves that the significant contributing module within the HaarModic Network is the Multi-Resolution Haar (MR-Haar) block. Instead of mining in one of two axes or extracting separately, the MR-Haar block projects whole motion sequences to a mixed-up coordinate in higher resolution with 2D Haar Transform, allowing the network to give scope to information from both axes in different resolutions. With the MR-Haar block, the HaarMoDic network can make predictions referring to a broader range of information. Experimental results demonstrate that HaarMoDic surpasses state-of-the-art methods in every testing interval on the Human3.6M dataset in the Mean Per Joint Position Error (MPJPE) metric.
... To validate the effectiveness of the proposed enhancements, comparative experiments were conducted using the dataset in this study with models such as DeepLabv3 [32], DeepLabv3+, Unet, HRNet [33], FCN, PSPNet, and LR-ASPP [34]. All models were trained on the same dataset, with experimental settings and parameters detailed in paper's without any modification except epoch. ...
... On Endoscapes2023 dataset, the compared methods consist of four fine-tuned segmentation models, namely Mask-RCNN [14], Cascade Mask-RCNN [15], Mask2Former [9], and MaskDINO [10]. On EndoVis2018 dataset, the compared methods include four categories: (1) Three reported methods in the 2018 Robot Scene Segmentation Challenge like NCT, UNC, and OTH [13]; (2) Multiscale feature fusion networks such as U-Net [16], DeepLabv3+ [4], UPerNet [17], and HRNet [18]; (3) Fine-tuned Vision transformer models, Mask2Former and MaskDINO; (4) Specific-designed segmentation methods, such as SegFormer [19], SegNeXt [20], STswinCL [8], and LSKANet [6]. ...
Chapter
Full-text available
Surgical scene segmentation is a fundamental task in robotic-assisted laparoscopic surgery understanding. It often contains various anatomical structures and surgical instruments, where similar local textures and fine-grained structures make segmentation a difficult task. The vision-specific transformer method is a promising way for surgical scene understanding. However, there are still two main challenges. Firstly, the absence of inner-patch information fusion leads to poor segmentation performance. Secondly, the specific characteristics of the anatomy and instruments are not specifically modeled. To tackle the above challenges, we propose a novel Transformer-based framework with an Asymmetric Feature Enhancement module (TAFE), which enhances local information and then actively fuses the improved feature pyramid into the embeddings from transformer encoders by a multiscale interaction attention strategy. The proposed method outperforms the SOTA methods in several different surgical segmentation tasks and additionally proves its ability to recognize fine-grained structure. Code is available at https://github.com/cyuan-sjtu/ViT-asym.
... When sufficient data is available, incorporating larger backbone networks, such as hrnet (J. Wang et al., 2019), ViT (Dosovitskiy et al., 2020) or Swin (Z. Liu et al., 2021) can further enhance model performance, achieving superior analytical results. ...
Preprint
This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research.
... If more redundant feature maps and a larger receptive field for effective feature extraction are needed, more ghost convolutions must be stacked. Fig. 5 shows how we use the HG block [43] from its backbone to replace convolution operations with ghost convolutions, drawing inspiration from the recent success of RT-DETR [44]. Furthermore, as shown in Fig. 5, we suggest an improved version of the GCHG block to improve the extraction of advanced semantic characteristics. ...
Article
Full-text available
Recent years have seen a surge in interest in object detection on remote sensing images for applications such as surveillance and management. However, challenges like small object detection, scale variation, and the presence of closely packed objects in these images hinder accurate detection. Additionally, the motion blur effect further complicates the identification of such objects. To address these issues, we propose enhanced YOLOv9 with a transformer head (YOLOv9-TH). The model introduces an additional prediction head for detecting objects of varying sizes and swaps the original prediction heads for transformer heads to leverage self-attention mechanisms. We further improve YOLOv9-TH using several strategies, including data augmentation, multi-scale testing, multi-model integration, and the introduction of an additional classifier. The cross-stage partial (CSP) method and the ghost convolution hierarchical graph (GCHG) are combined to improve detection accuracy by better utilizing feature maps, widening the receptive field, and precisely extracting multi-scale objects. Additionally, we incorporate the E-SimAM attention mechanism to address low-resolution feature loss. Extensive experiments on the VisDrone2021 and DIOR datasets demonstrate the effectiveness of YOLOv9-TH, showing good improvement in mAP compared to the best existing methods. The YOLOv9-THe achieved 54.2% of mAP50 on the VisDrone2021 dataset and 92.3% of mAP on the DIOR dataset. The results confirm the model's robustness and suitability for real-world applications, particularly for small object detection in remote sensing images.
... Although the choice of pose estimation model introduces some variation in the results, these differences are relatively minor when contrasted with the impact of the detector. For instance, ViTPose exhibits slightly higher recall values compared to Hrnet [40] and SimCC [41], but the overall performance differences are slight. This suggests that, although pose estimation models do play a role in the final outcomes, the influence of their choice is significantly less substantial compared to the detector. ...
Article
Full-text available
This paper presents an AI framework for automated detection of personal protective equipment (PPE) compliance in complex construction and industrial environments. Ensuring health and safety standards is essential for protecting workers engaged in construction, repair, or inspection activities. The framework leverages deep learning techniques for worker detection and pose estimation to enable accurate PPE identification under challenging conditions. The framework components are replaceable, and employ the InternImage-L detector for worker detection, ViTPose for pose estimation, and YOLOv7 for PPE recognition. A duplicate removal stage, combined with pose information, ensures PPE items are accurately assigned to individual workers. The approach addresses challenges like shadows, partial occlusions, or densely grouped workers. Evaluated on diverse datasets from real-world industrial settings, the framework achieves competitive precision and recall, particularly for critical PPE like helmets and vests, demonstrating robustness for safety monitoring and proactive risk management.
... 1) HRNet-W48 [18]: A model composed of parallel branches that maintain feature maps at various resolutions. It thus preserves high-resolution features via multi-resolution fusion, where feature maps are exchanged via bilateral connections. ...
Conference Paper
Full-text available
Human pose estimation (HPE) models based on RGB images are widely used in applications such as surveillance, sports analytics, and healthcare. However, they often overlook privacy requirements mandated by data protection laws worldwide. One approach to ensure privacy is to use alternative data types such as LiDAR, WiFi signals, or depth images, but these can result in reduced accuracy compared to RGB-based models. A practical alternative is to apply obfuscation techniques to RGB images, which can help preserve privacy while retaining the benefits of high-resolution visual data. This paper investigates how HPE models perform under different types and intensities of image obfuscation. We focus on two common methods: Gaussian blur and pixelation, applied at multiple intensity levels. We evaluated three state-of-the-art HPE models, HRNet-W48, ViTPose Base, and RTMPose-l, in the COCO dataset, analyzing their robustness against obfuscated images. Key questions include how performance degrades with increasing obfuscation, which body parts are most affected, and the obfuscation thresholds where model fine-tuning becomes necessary. To complement this performance evaluation, we also show heatmaps to explain how obfuscation impacts the model focus on different body parts. Our study shows how to balance privacy and accuracy in HPE systems, offering practical guidance on using obfuscation techniques with existing RGB models to enhance privacy while maintaining performance without retraining.
... The combination of the aforementioned software and hardware provides strong support for our research projects. In our human pose estimation tasks, we compared several widely used pose estimation models, specifically OpenPose, DeepCut, AlphaPose, HRNet [36], FasterPose [37], TransPose [38], and YOLOv8-pose. In this experiment, we uniformly selected 300 epochs, a batch size of 32, an initial learning rate (lr0) of 0.01, and a final learning rate of 0.1. ...
Article
Full-text available
Pose estimation is a crucial task in the field of human motion analysis, and detecting poses is a topic of significant interest. Traditional detection algorithms are not only time-consuming and labor-intensive but also suffer from deficiencies in accuracy and objectivity. To address these issues, we propose an improved pose estimation algorithm based on the YOLOv8 framework. By incorporating a novel attention mechanism, SimDLKA, into the original YOLOv8 model, we enhance the model’s ability to selectively focus on input data, thereby improving its decoupling and flexibility. In the feature fusion module of YOLOv8, we replace the original Bottleneck module with the SimDLKA module and integrate it with the C2F module to form the C2F-SimDLKA structure, which more effectively fuses global semantics, especially for medium to large targets. Furthermore, we introduce a new loss function, DCIOU, based on the YOLOv8 loss function, to improve the forward propagation of model training. Results indicate that our new loss function has a 3–5 loss value reduction compared to other loss functions. Additionally, we have independently constructed a large-scale pose estimation dataset, HP, employing various data augmentation strategies, and utilized the open-source COCO and MPII datasets for model training. Experimental results demonstrate that, compared to the traditional YOLOv8, our improved YOLOv8 algorithm increases the mAP value on the pose estimation dataset by 2.7% and the average frame rate by approximately 3 frames. This method provides a valuable reference for pose detection in pose estimation.
... To verify the effectiveness of leveraging the text SDF map in scene text segmentation, we compare the proposed method with the existing semantic segmentation methods (Semantic FPN [27], PSPNet [33], DeeplabV3+ [19], HrNetV2-W48, HrNetV2-W48-OCR [34], DDP [35] and Mask2Former [36]) and scene text segmentation methods (SMANet [25], TexR-Net [9], ARMNet [7], Textformer [26], Dang et al. [16], [17], Yu et al. [8] and WASNet [37]). TABLE 2 illustrates the performance comparisons on three different scene text segmentation datasets, ICDAR13 [31], Total-Text [32], and TextSeg [9]. ...
Article
Full-text available
Scene text segmentation is to predict pixel-wise text regions from an image, enabling in-image text editing or removal. One of the primary challenges is to remove noises including non-text regions and predict intricate text boundaries. To deal with that, traditional approaches utilize a text detection or recognition module explicitly. However, they are likely to highlight noise around the text. Because they did not sufficiently consider the boundaries of text, they fail to accurately predict the fine details of text. In this paper, we introduce leveraging text signed distance function (SDF) map, which encodes distance information from text boundaries, in scene text segmentation to explicitly provide text boundary information. By spatial cross attention mechanism, we encode the text-attended feature from the text SDF map. Then, both visual and text-attended features are utilized to decode the text segmentation map. Our approach not only mitigates confusion between text and complex backgrounds by eliminating false positives such as logos and texture blobs located far from the text, but also effectively captures fine details of complex text patterns by leveraging text boundary information. Extensive experiments demonstrate that leveraging text SDF map in scene text segmentation provides superior performances on various scene text segmentation datasets.
... Various networks, including fully convolutional networks (FCNs) [18], U-Net [19], DenseNet [20], ResNet [21], and HRNet [22], have been employed for classification and segmentation tasks, particularly for building footprint extraction. ...
Article
Full-text available
The analysis of aerial and satellite images for building footprint detection is one of the major challenges in photogrammetry and remote sensing. This information is useful for various applications, such as urban planning, disaster monitoring, and 3D city modeling. However, it has become a significant challenge due to the diverse characteristics of buildings, such as shape, size, and shadow interference. This study investigated the simultaneous use of aerial and satellite images to improve the accuracy of deep learning models in building footprint detection. For this purpose, aerial images with a spatial resolution of 30 cm and Sentinel-2 satellite imagery were employed. Several satellite-derived spectral indices were extracted from the Sentinel-2 image. Then, U-Net models combined with ResNet-18 and ResNet-34 were trained on these data. The results showed that the combination of the U-Net model with ResNet-34, trained on a dataset obtained by integrating aerial images and satellite indices, referred to as RGB–Sentinel–ResNet34, achieved the best performance among the evaluated models. This model attained an accuracy of 96.99%, an F1-score of 90.57%, and an Intersection over Union of 73.86%. Compared to other models, RGB–Sentinel–ResNet34 showed a significant improvement in accuracy and generalization capability. The findings indicated that the simultaneous use of aerial and satellite data can substantially enhance the accuracy of building footprint detection.
Article
Off‐road autonomous vehicles (OAVs) are becoming increasingly popular for navigating challenging environments in agriculture, military, and exploration applications. These vehicles face unique challenges, such as unpredictable terrain, dynamic obstacles, and varying environmental conditions. Therefore, it is essential to have an efficient terrain classification system to ensure safe and efficient operation of OAVs. This paper provides an overview of recent advances and emerging trends in off‐road terrain classification methods. Through a comprehensive literature review, this study explores the use of sensor modalities and techniques that leverage both appearance and geometry of the terrain for classification tasks. The study discusses learning‐based approaches, particularly deep learning, and highlights the integration of multiple sensor modalities through hybrid multimodal techniques. Finally, this study reviews the available off‐road datasets and explores the use cases and applications of terrain classification across various autonomous domains. Given the rapid advancements in terrain classification, this paper organizes and surveys to provide a comprehensive overview. By offering a structured review of the current landscape, this paper significantly enhances our understanding of terrain classification in unstructured environments, while also highlighting important areas for future research, particularly in deep‐learning‐based advancements.
Preprint
Full-text available
The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
Thesis
Full-text available
Semantic segmentation of aerial images is vital for Unmanned Aerial Vehicle (UAVs) applications, such as land cover mapping, surveillance, and identifying flood-affected areas for effective natural disaster management and flood impact mitigation. Traditional CNN-based techniques encounter significant challenges in retaining specific information from deeper layers. Moreover, existing transformer-based architectures often demand high computational resources or produce single-scale, low-resolution features. To address these limitations, we proposed a novel transformer-based model named SwinSegFormer that harnesses the strengths of SegFormer and Swin Transformer (SwinT). Our model was trained on the FloodNet dataset and benchmark evaluations, focusing on challenging classes such as vehicles, pools, and flooded and non-flooded roads, which are crucial to segment for effective disaster management. This potentially allows our model to be utilized in first aid activities during floods. The proposed model achieved notable results with a validation mIoU of 71.99%, mDice of 82.86%, and mAcc of 82.69%. This represents an 8-10% improvement compared to state-of-the-art methods.
Article
Full-text available
Images captured from a drone’s perspective are significantly impacted in terms of target detection algorithm performance due to the notable differences in target scales and the presence of numerous small target objects lacking detailed information. This paper proposes a Remote Sensing Small Target Detector (CF-YOLO) based on the YOLOv11 model to address the challenges of small target detection. Firstly, addressing the issue of small target information loss that may arise from hierarchical convolutional structures, we conduct in-depth research on the Path Aggregation Network (PAN) and innovatively propose a Cross-Scale Feature Pyramid Network (CS-FPN). Secondly, to overcome the problems of positional information deviation and feature redundancy during multi-scale feature fusion, we design a Feature Recalibration Module (FRM) and a Sandwich Fusion Module. We advocate for initial feature fusion through the FRM module, followed by feature enhancement using the Sandwich module. Finally, we optimize and reconstruct the model using the RFAConv module and LSDECD detection head. Experiments show that on the public VisDrone dataset, TinyPerson dataset, and HIT-UAV dataset, CF-YOLO improves the mAP50 by 12.7%, 10.1%, and 3.5%, respectively, compared to the baseline model. Compared to other methods, CF-YOLO demonstrates superior performance.
Article
Full-text available
Landslide mapping is critically important for providing detailed spatial information on hazard extent in a timely manner that ultimately contributes to the protection of human lives and critical infrastructure. In the context of increasing demands for scalable and automated solutions, Earth Observation (EO) data coupled with deep learning offer great potential to enhance the speed and accuracy of emergency mapping. This study explores the utility of a deep learning model with the U-Net architecture for automated landslide mapping using data from optical Sentinel-2 and Synthetic Aperture Radar (SAR) Sentinel-1 satellites. We investigate the effectiveness of various optical (visible, near-infrared, and short-wave infrared) and SAR-derived features (backscatter coefficients, polarimetric features, interferometric coherence), used both independently and in combination. Additionally, we assess the impact of increasing the number of pre-/post-event SAR observations on classification performance. The U-Net models are trained and tested using globally distributed and limited reference data (563 unique patches). Optical features consisted of one pre-/post-event feature, whereas SAR features had three for each reference sample. Our analysis shows that the highest classification accuracies are consistently achieved using optical features (F1-score of 0.83 with visible, near-, and short-wave infrared bands). No substantial improvements were recorded when SAR features were combined with optical features. The usage of the most common optical features (visible and near-infrared) shows the lowest accuracies compared to their combination of short-wave infrared or red-edge bands. Increasing the number of pre-/post-event SAR features improves the SAR-based accuracies. To promote further advancements in automated landslide mapping using deep learning, the landslide reference dataset generated in this study is freely available at (https://doi.org/10.5281/zenodo.15284357).
Article
Transmission lines are an important component of the power transmission network, and their defects directly affect the safety and stability of the power system. This article proposes an improved YOLOv8 model based on the IRMB-SWC module to efficiently detect defects of transmission lines. Firstly, the C2f-iRMB-SWC module adopts the Concatenate to Fusion(C2f) structure of feature fusion, which enhances the integration ability of multi-level features and extracts richer feature information. Secondly, the module introduces the Improved Reception Field Multi Branch (iRMB) design, which extends the receptive field through a multi branch structure, enabling the model to capture more detailed local features and significantly improve its ability to detect complex defects. Then, by combining the Squeeze and Excitation (SE) mechanism with Sliding Window Convolution(SWC) operations, the model's response to key features was further enhanced and feature representation was optimized. Finally, through experimental evaluation in the task of detecting defects in transmission lines, the improved YOLOv8 model showed a significant improvement in detection accuracy, with a 4.4\% increase compared to the original YOLOv8 model. This validates the effectiveness of the IRMB-SWC module in practical applications and meets the accuracy requirements for wire defect detection.
Preprint
Full-text available
Accurate mapping of irrigation methods is crucial for sustainable agricultural practices and food systems. However, existing models that rely solely on spectral features from satellite imagery are ineffective due to the complexity of agricultural landscapes and limited training data, making this a challenging problem. We present Knowledge-Informed Irrigation Mapping (KIIM), a novel Swin-Transformer based approach that uses (i) a specialized projection matrix to encode crop to irrigation probability, (ii) a spatial attention map to identify agricultural lands from non-agricultural lands, (iii) bi-directional cross-attention to focus complementary information from different modalities, and (iv) a weighted ensemble for combining predictions from images and crop information. Our experimentation on five states in the US shows up to 22.9\% (IoU) improvement over baseline with a 71.4% (IoU) improvement for hard-to-classify drip irrigation. In addition, we propose a two-phase transfer learning approach to enhance cross-state irrigation mapping, achieving a 51% IoU boost in a state with limited labeled data. The ability to achieve baseline performance with only 40% of the training data highlights its efficiency, reducing the dependency on extensive manual labeling efforts and making large-scale, automated irrigation mapping more feasible and cost-effective.
Article
Accurate and robust relative pose estimation is the first step in ensuring the success of an active debris removal mission. This paper introduces a novel method to detect structural markers on the European Space Agency’s Environmental Satellite (ENVISAT) for safe de-orbiting using image processing and Convolutional Neural Networks (CNNs). Advanced image preprocessing techniques, including noise addition and blurring, are employed to improve marker detection accuracy and robustness from a chaser spacecraft. Additionally, we address the challenges posed by eclipse periods, during which the satellite’s corners are not visible, preventing measurement updates in the Unscented Kalman Filter (UKF). To maintain estimation quality in these periods of data loss, we propose a covariance-inflating approach in which the process noise covariance matrix is adjusted, reflecting the increased uncertainty in state predictions during the eclipse. This adaptation ensures more accurate state estimation and system stability in the absence of measurements. The initial results show promising potential for autonomous removal of space debris, supporting proactive strategies for space sustainability. The effectiveness of our approach suggests that our estimation method, combined with robust noise adaptation, could significantly enhance the safety and efficiency of debris removal operations by implementing more resilient and autonomous systems in actual space missions.
Article
Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model’s capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model’s ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available.
Article
The types, distribution, trends, and scale of internal defects often reflect whether the welding process parameters are reasonable and indicate directions for optimization, and the analysis of these can help to provide a scientific basis and guidance for the gaps generated by various welding process parameters. To provide more necessary data and perspectives, this paper proposes a method for analyzing the size of internal welding defects using segmentation algorithms combined with phased array ultrasonic technology. The design of the segmentation algorithm for welding phased array images follows the idea of simplifying scenarios by downplaying deeper conceptual understanding. The proposed method also introduces the equilibrium value (EV) to represent the balance between over-segmentation and under-segmentation. In order to verify whether the complete method can achieve the expected results, the paper demonstrates the plausibility of the existence of EV and which shows the relationship with the metric’s ability to balance positive and negative deviations. And multiple comparative experiments are conducted to illustrate the feasibility and advantages of the proposed model, OCRNet-CPD, from various aspects, which achieve the simplification of the whole system and potentially reduce the computational complexity while ensuring its segmentation accuracy. Finally, the proposed method was used to establish a relative size chart for specific weldments, offering the possibility of addressing issues in welding processes from both numerical and image perspectives.
Article
In rapidly urbanizing regions, encroachment on native green spaces has exacerbated ecological issues such as urban heat islands and flooding. Accurate mapping of tree species distribution is therefore vital for sustainable urban management. However, the high heterogeneity of urban landscapes, resulting from the coexistence of diverse land covers, built infrastructure, and anthropogenic activities, often leads to reduced robustness and transferability of remote sensing classification methods across different images and regions. In this study, we used very high–resolution Pléiades imagery and field-verified samples of eight common urban trees and background land covers. By employing transfer learning with advanced segmentation networks, we evaluated each model’s accuracy, robustness, and efficiency. The best-performing network delivered markedly superior classification consistency and required substantially less training time than a model trained from scratch. These findings offer concise, practical guidance for selecting and deploying deep learning methods in urban tree species mapping, supporting improved ecological monitoring and planning.
Article
Background Multicellular tumor spheroids (MCTS) are advanced cell culture systems for assessing the impact of combinatorial radio(chemo)therapy as they exhibit therapeutically relevant in vivo–like characteristics from 3-dimensional cell–cell and cell–matrix interactions to radial pathophysiological gradients. State-of-the-art assays quantify long-term curative endpoints based on collected brightfield image time series from large treated spheroid populations per irradiation dose and treatment arm. This analyses require laborious spheroid segmentation of up to 100,000 images per treatment arm to extract relevant structural information from the images (e.g., diameter, area, volume, and circularity). While several image analysis algorithms are available for spheroid segmentation, they all focus on compact MCTS with a clearly distinguishable outer rim throughout growth. However, they often fail for the common case of treated MCTS, which may partly be detached and destroyed and are usually obscured by dead cell debris. Results To address these issues, we successfully train 2 fully convolutional networks, UNet and HRNet, and optimize their hyperparameters to develop an automatic segmentation for both untreated and treated MCTS. We extensively test the automatic segmentation on larger, independent datasets and observe high accuracy for most images with Jaccard indices around 90%. For cases with lower accuracy, we demonstrate that the deviation is comparable to the interobserver variability. We also test against previously published datasets and spheroid segmentations. Conclusions The developed automatic segmentation can not only be used directly but also integrated into existing spheroid analysis pipelines and tools. This facilitates the analysis of 3-dimensional spheroid assay experiments and contributes to the reproducibility and standardization of this preclinical in vitro model.
Article
Human pose estimation (HPE) is a field focused on estimating human poses by detecting key points in images. HPE includes methods like top-down and bottom-up approaches. The top-down approach uses a two-stage process, first locating and then detecting key points on humans with bounding boxes, whereas the bottom-up approach directly detects individual key points and integrates them to estimate the overall pose. In this article, we address the problem of bounding box detection inaccuracies in certain situations using the top-down method. The detected bounding boxes, which serve as input for the model, impact the accuracy of pose estimation. Occlusions occur when a part of the target’s body is obscured by a person or object and hinder the model’s ability to detect complete bounding boxes. Consequently, the model produces bounding boxes that do not recognize occluded parts, resulting in their exclusion from the input used by the HPE model. To mitigate this issue, we introduce the Restoring Occluded Mask Image for 2D Human Pose Estimation (ROM-Pose), comprising a restoration model and an HPE model. The restoration model is designed to delineate the boundary between the target’s grayscale mask (occluded image) and the blocker’s grayscale mask (occludee image) using the specially created Whole Common Objects in Context (COCO) dataset. Upon identifying the boundary, the restoration model restores the occluded image. This restored image is subsequently overlaid onto the RGB image for use in the HPE model. By integrating occluded parts’ information into the input, the bounding box includes these areas during detection, thus enhancing the HPE model’s ability to recognize them. ROM-Pose achieved a 1.6% improvement in average precision (AP) compared to the baseline.
Article
Full-text available
Semantic segmentation of high-resolution remote sensing imagery is pivotal in decision-making and analysis in a wide array of sectors, including but not limited to water management, agriculture, military operations, and environmental protection. This technique offers detailed and precise feature information, facilitating an accurate imagery interpretation. Despite its importance, existing methods often fall short as they lack a mechanism for spatial location feature screening. These methods tend to treat all extracted features on an equal footing, neglecting their spatial relevance. To overcome these shortcomings, we introduce a groundbreaking approach, the Spatially Adaptive Interaction Network (SAINet), designed for dynamic feature interaction in remote sensing semantic segmentation. SAINet integrates a spatial refinement module that leverages local context information to filter spatial locations and extract prominent regions. This enhancement allows the network to concentrate on pertinent areas, thereby improving the quality of feature representation. Furthermore, we present an innovative spatial interaction module that utilizes a spatial adaptive modulation mechanism. This mechanism dynamically selects and allocates spatial position weights, fostering effective interaction between local salient areas and global information, which in turn boosts the network’s segmentation performance. The adaptability of SAINet allows it to capture more informative features, leading to a significant improvement in segmentation accuracy. We have validated the effectiveness and capability of our proposed approach through experiments on widely recognized public datasets such as DeepGlobe, Vaihingen, and Potsdam.
Conference Paper
Full-text available
Semantic segmentation of surgical instruments plays a critical role in computer-assisted surgery. However, specular reflection and scale variation of instruments are likely to occur in the surgical environment, undesirably altering visual features of instruments, such as color and shape. These issues make semantic segmentation of surgical instruments more challenging. In this paper, a novel network, Pyramid Attention Aggregation Network, is proposed to aggregate multiscale attentive features for surgical instruments. It contains two critical modules: Double Attention Module and Pyramid Upsampling Module. Specifically, the Double Attention Module includes two attention blocks (i.e., position attention block and channel attention block), which model semantic dependencies between positions and channels by capturing joint semantic information and global contexts, respectively. The attentive features generated by the Double Attention Module can distinguish target regions, contributing to solving the specular reflection issue. Moreover, the Pyramid Upsampling Module extracts local details and global contexts by aggregating multi-scale attentive features. It learns the shape and size features of surgical instruments in different receptive fields and thus addresses the scale variation issue. The proposed network achieves state-of-the-art performance on various datasets. It achieves a new record of 97.10% mean IOU on Cata7. Besides, it comes first in the MICCAI EndoVis Challenge 2017 with 9.90% increase on mean IOU.
Conference Paper
Full-text available
This paper reviews the first NTIRE challenge on perceptual image enhancement with the focus on proposed solutions and results. The participating teams were solving a real-world photo enhancement problem, where the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with Canon 70D DSLR camera. The considered problem embraced a number of computer vision subtasks, such as image denoising, image resolution and sharpness enhancement, image color/contrast/exposure adjustment, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. From above 200 registered participants, 13 teams submitted solutions for the final test phase of the challenge. The proposed solutions significantly improved baseline results, defining the state-of-the-art for practical image enhancement.
Conference Paper
Full-text available
Context is essential for semantic segmentation. Due to the diverse shapes of objects and their complex layout in various scene images, the spatial scales and shapes of contexts for different objects have very large variation. It is thus ineffective or inefficient to aggregate various context information from a predefined fixed region. In this work, we propose to generate a scale-and shape-variant semantic mask for each pixel to confine its contextual region. To this end, we first propose a novel paired convolution to infer the semantic correlation of the pair and based on that to generate a shape mask. Using the inferred spatial scope of the contextual region, we propose a shape-variant convolution, of which the receptive field is controlled by the shape mask that varies with the appearance of input. In this way, the proposed network aggregates the context information of a pixel from its semantic-correlated region instead of a predefined fixed region. Furthermore, this work also proposes a labeling denoising model to reduce wrong predictions caused by the noisy low-level features. Without bells and whistles, the proposed segmentation network achieves new state-of-the-arts consistently on the six public segmentation datasets.
Article
Full-text available
Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask RCNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbones which are originally designed for object classification task. Newly, in this work, we present Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each Ushape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to construct a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one-stage detectors. Specifically, on MSCOCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which are the new stateof-the-art results among one-stage detectors. The code will be made available on https://github.com/qijiezhao/M2Det.
Article
Full-text available
We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.
Chapter
Full-text available
We develop a robust multi-scale structure-aware neural network for human pose estimation. This method improves the recent deep conv-deconv hourglass models with four key improvements: (1) multi-scale supervision to strengthen contextual feature learning in matching body keypoints by combining feature heatmaps across scales, (2) multi-scale regression network at the end to globally optimize the structural matching of the multi-scale features, (3) structure-aware loss used in the intermediate supervision and at the regression to improve the matching of keypoints and respective neighbors to infer a higher-order matching configurations, and (4) a keypoint masking training scheme that can effectively fine-tune our network to robustly localize occluded keypoints via adjacent matches. Our method can effectively improve state-of-the-art pose estimation methods that suffer from difficulties in scale varieties, occlusions, and complex multi-person scenarios. This multi-scale supervision tightly integrates with the regression network to effectively (i) localize keypoints using the ensemble of multi-scale features, and (ii) infer global pose configuration by maximizing structural consistencies across multiple keypoints and scales. The keypoint masking training enhances these advantages to focus learning on hard occlusion samples. Our method achieves the leading position in the MPII challenge leaderboard among the state-of-the-art methods.
Article
Region based detectors like Faster R-CNN and R-FCN have achieved leading performance on object detection benchmarks. However, in Faster R-CNN, RoI pooling is used to extract feature of each region, which might harm the classification as the RoI pooling loses spatial resolution. Also it gets slow when a large number of proposals are utilized. R-FCN is a fully convolutional structure that uses a position-sensitive pooling layer to extract prediction score of each region, which speeds up network by sharing computation of RoIs and prevents the feature map from losing information in RoI-pooling. But R-FCN can not benefit from fully connected layer (or global average pooling), which enables Faster R-CNN to utilize global context information. In this paper, we propose R-FCN++ to address this issue in two-fold: first we involve Global Context Module to improve the classification score maps by adopting large, separable convolutional kernels. Second we introduce a new pooling method to better extract scores from the score maps, by using row-wise or column-wise max pooling. Our approach achieves state-of-the-art single-model results on both Pascal VOC and MS COCO object detection benchmarks, 87.3% on Pascal VOC 2012 test dataset and 42.3% on COCO 2015 test-dev dataset. Code will be made publicly available.
Chapter
In this paper, we study the context aggregation problem in semantic segmentation. Motivated by that the label of a pixel is the category of the object that the pixel belongs to, we present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we learn object regions under the supervision of the ground-truth segmentation. Second, we compute the object region representation by aggregating the representations of the pixels lying in the object region. Last, we compute the relation between each pixel and each object region, and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations. We empirically demonstrate our method achieves competitive performance on various benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff. Our submission “HRNet + OCR + SegFix” achieves the 1st{1}^{\mathrm {st}} place on the Cityscapes leaderboard by the ECCV 2020 submission deadline. Code is available at: https://git.io/openseg and https://git.io/HRNet.OCR.
Article
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality . While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN.
Conference Paper
Recently, learning-based algorithms for image inpainting achieve remarkable progress dealing with squared or irregular holes. However, they fail to generate plausible textures inside damaged area because there lacks surrounding information. A progressive inpainting approach would be advantageous for eliminating central blurriness, i.e., restoring well and then updating masks. In this paper, we propose full-resolution residual network (FRRN) to fill irregular holes, which is proved to be effective for progressive image inpainting. We show that well-designed residual architecture facilitates feature integration and texture prediction. Additionally, to guarantee completion quality during progressive inpainting, we adopt N Blocks, One Dilation strategy, which assigns several residual blocks for one dilation step. Correspondingly, a step loss function is applied to improve the performance of intermediate restorations. The experimental results demonstrate that the proposed FRRN framework for image inpainting is much better than previous methods both quantitatively and qualitatively.
Article
Human parsing has received considerable interest due to its wide application potentials. Nevertheless, it is still unclear how to develop an accurate human parsing system in an efficient and elegant way. In this paper, we identify several useful properties, including feature resolution, global context information and edge details, and perform rigorous analyses to reveal how to leverage them to benefit the human parsing task. The advantages of these useful properties finally result in a simple yet effective Context Embedding with Edge Perceiving (CE2P) framework for single human parsing. Our CE2P is end-to-end trainable and can be easily adopted for conducting multiple human parsing. Benefiting the superiority of CE2P, we won the 1st places on all three human parsing tracks in the 2nd Look into Person (LIP) Challenge. Without any bells and whistles, we achieved 56.50% (mIoU), 45.31% (mean APr) and 33.34% (APp0.5) in Track 1, Track 2 and Track 5, which outperform the state-of-the-arts more than 2.06%, 3.81% and 1.87%, respectively. We hope our CE2P will serve as a solid baseline and help ease future research in single/multiple human parsing. Code has been made available at https://github.com/liutinglt/CE2P.
Article
Taking the feature pyramids into account has become a crucial way to boost the object detection performance. While various pyramid representations have been developed, previous works are still inefficient to integrate the semantical information over different scales. Moreover, recent object detectors are suffering from accurate object location applications, mainly due to the coarse definition of the “positive” examples at training and predicting phases. In this paper, we begin by analyzing current pyramid solutions, and then propose a novel architecture by reconfiguring the feature hierarchy in a flexible yet effective way. In particular, our architecture consists of two lightweight and trainable processes: global attention and local reconfiguration. The global attention is to emphasize the global information of each feature scale, while the local reconfiguration is to capture the local correlations across different scales. Both the global attention and local reconfiguration are non-linear and thus exhibit more expressive ability. Then, we discover that the loss function for object detectors during training is the central cause of the inaccurate location problem. We propose to address this issue by reshaping the standard cross entropy loss such that it focuses more on accurate predictions. Both the feature reconfiguration and the consistent loss could be utilized in popular one-stage (SSD, RetinaNet) and two-stage (Faster R-CNN) detection frameworks. Extensive experimental evaluations on PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO datasets demonstrate that, our models achieve consistent and significant boosts compared with other state-of-the-art methods.
Conference Paper
Many machine vision applications require predictions for every pixel of the input image (for example semantic segmentation, boundary detection). Models for such problems usually consist of encoders which decreases spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. Therefore this paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise prediction tasks. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce a novel decoder: bilinear additive upsampling. (3) We introduce new residual-like connections for decoders. (4) We identify two decoder types which give a consistently high performance.
Article
The de facto algorithm for facial landmark estimation involves running a face detector with a subsequent deformable model fitting on the bounding box. This encompasses two basic problems: i) the detection and deformable fitting steps are performed independently, while the detector might not provide best-suited initialization for the fitting step, ii) the face appearance varies hugely across different poses, which makes the deformable face fitting very challenging and thus distinct models have to be used (e.g., one for profile and one for frontal faces). In this work, we propose the first, to the best of our knowledge, joint multi-view convolutional network to handle large pose variations across faces in-the-wild, and elegantly bridge face detection and facial landmark localization tasks. Existing joint face detection and landmark localization methods focus only on a very small set of landmarks. By contrast, our method can detect and align a large number of landmarks for semi-frontal (68 landmarks) and profile (39 landmarks) faces. We evaluate our model on a plethora of datasets including standard static image datasets such as IBUG, 300W, COFW, and the latest Menpo Benchmark for both semi-frontal and profile faces. Significant improvement over state-of-the-art methods on deformable face tracking is witnessed on 300VW benchmark. We also demonstrate state-ofthe- art results for face detection on FDDB and MALF datasets.
Chapter
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve real-time inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate high-resolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048 ×\times 1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.
Chapter
We present a box-free bottom-up approach for the tasks of pose estimation and instance segmentation of people in multi-person images using an efficient single-shot model. The proposed PersonLab model tackles both semantic-level reasoning and object-part associations using part-based modeling. Our model employs a convolutional network which learns to detect individual keypoints and predict their relative displacements, allowing us to group keypoints into person pose instances. Further, we propose a part-induced geometric embedding descriptor which allows us to associate semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Our system is based on a fully-convolutional architecture and allows for efficient inference, with runtime essentially independent of the number of people present in the scene. Trained on COCO data alone, our system achieves COCO test-dev keypoint average precision of 0.665 using single-scale inference and 0.687 using multi-scale inference, significantly outperforming all previous bottom-up pose estimation systems. We are also the first bottom-up method to report competitive results for the person class in the COCO instance segmentation task, achieving a person category average precision of 0.417.