Figure 3 - uploaded by Angtian Wang
Content may be subject to copyright.

# Example of robust bounding box voting results. Blue box: ground truth; red box: bounding box by Faster R-CNN; green box: bounding box generated by robustly combining voting results. Our proposed part-based voting mechanism generates probability maps (right) for the object center (cyan point), the top left corner (purple point) and the bottom right corner (yellow point) of the bounding box.

Source publication

## Contexts in source publication

Context 1
... bounding box voting. While CompositionalNets can be generalized to localize partially occluded objects using our proposed detection layer, estimating the bounding box of an object under occlusion is more difficult because a significant amount of the object might not be visible (Figure 3). We propose to solve this problem by generalizing the part-based voting mechanism in CompositionalNets to vote for the bounding box corners in addition to the object center. ...
Context 2
... particular, we learn additional mixture components that model the expected feature activations F around bounding box corners p(F p |Θ c y ), where c = {ct, bl, tr} are the object center ct and two opposite bounding box corners {bl, tr}. Figure 3 illustrates the spatial likelihood maps R c of all three models. We generate a bounding box using the two points that have maximal likelihood. ...

## Similar publications

Chapter
Full-text available
Monocular 3D object detection is a challenging task due to unreliable depth, resulting in a distinct performance gap between monocular and LiDAR-based approaches. In this paper, we propose a novel domain adaptation based monocular 3D object detection framework named DA-3Ddet, which adapts the feature from unsound image-based pseudo-LiDAR domain to...
Article
Full-text available
RoIPool/RoIAlign is an indispensable process for the typical two-stage object detection algorithm, it is used to rescale the object proposal cropped from the feature pyramid to generate a fixed size feature map. However, these cropped feature maps of local receptive fields will heavily lose global context information. To tackle this problem, we pro...
Article
Full-text available
Object tracking has been one of the most active research directions in the field of computer vision. In this paper, an effective single-object tracking algorithm based on two-step spatiotemporal feature fusion is proposed, which combines deep learning detection with the kernelized correlation filtering (KCF) tracking algorithm. Deep learning detect...

## Citations

... Single-vehicle perception has made remarkable achievements in object detection[21, 31,32], segmentation [29,33], and other tasks with the advent of deep learning. However, single-vehicle perception often suffers from environmental conditions such as occlusion [39,50] and severe weather[4, 16, 53], making accurate recognition challenging. To overcome these issues, several appealing studies have been devoted to collaborative perception[1, 2, 7, 42,49], which take advantage of sharing the multiple-viewpoint of the same scene with the Vehicle-to-Vehicle(V2V) communication [3]. ...
... norm4 = L2Norm ( 6 4 ) 38 s e l f . norm5 = L2Norm ( 3 2 ) 39 40 d e f f o r w a r d ( s e l f , x , x 1 , x 2 , x 3 , x 4 , e n h a n c e v 1 , e n h a n c e , b a t c h , k d f l a g = 0 ) : ...
Preprint
Multi-agent collaborative perception (MCP) has recently attracted much attention. It includes three key processes: communication for sharing, collaboration for integration, and reconstruction for different downstream tasks. Existing methods pursue designing the collaboration process alone, ignoring their intrinsic interactions and resulting in suboptimal performance. In contrast, we aim to propose a Unified Collaborative perception framework named UMC, optimizing the communication, collaboration, and reconstruction processes with the Multi-resolution technique. The communication introduces a novel trainable multi-resolution and selective-region (MRSR) mechanism, achieving higher quality and lower bandwidth. Then, a graph-based collaboration is proposed, conducting on each resolution to adapt the MRSR. Finally, the reconstruction integrates the multi-resolution collaborative features for downstream tasks. Since the general metric can not reflect the performance enhancement brought by MCP systematically, we introduce a brand-new evaluation metric that evaluates the MCP from different perspectives. To verify our algorithm, we conducted experiments on the V2X-Sim and OPV2V datasets. Our quantitative and qualitative experiments prove that the proposed UMC greatly outperforms the state-of-the-art collaborative perception approaches.
... For example, there are image classification [56,57] where the target object is overlapped by other objects, and facial expression recognition [58] where some facial parts are occluded by an object. To overcome this, many works tried to model the occlusion during training so that the trained network can robustly perform on its given task [59][60][61][62]. We try to model the lip occluded situation that frequently occurs when the speaker uses a mic or eats some food by using Naturalistic Occlusion Generation (NatOcc) of [61]. ...
Preprint
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.
... In contrast with SSD MobileNet and YOLO, Faster RCNN is widely used for high-precision and safety-critical mobile robot applications. However, Faster RCNN is also weak for detecting low-feature or occluded objects and has miss detection and false classification [17]. Generally, the indoor environment is more challenging than outdoor object detection due to severe occlusions of objects, objects with fewer features, and cluttered backgrounds. ...
... The Soldier-Body Sensor Network (S-BSN) was proposed by Han et al. [33], where the network collects the different types of data such as behaviours, physiology, emotions, fatigue, environments, and locations using wearable body sensors and performs the multi-level fusion to analyze and alerts the soldier's health when involved in extreme events. Wang et al. proposed contextaware compositional nets for detecting an object on different levels of occlusions [17]. The author segmented the contextual information via bounding box annotations and used the segmented information to train the context-aware CompositionalNet. ...
Article
Full-text available
Hazardous object detection (escalators, stairs, glass doors, etc.) and avoidance are critical functional safety modules for autonomous mobile cleaning robots. Conventional object detectors have less accuracy for detecting low-feature hazardous objects and have miss detection, and the false classification ratio is high when the object is under occlusion. Miss detection or false classification of hazardous objects poses an operational safety issue for mobile robots. This work presents a deep-learning-based context-aware multi-level information fusion framework for autonomous mobile cleaning robots to detect and avoid hazardous objects with a higher confidence level, even if the object is under occlusion. First, the image-level-contextual-encoding module was proposed and incorporated with the Faster RCNN ResNet 50 object detector model to improve the low-featured and occluded hazardous object detection in an indoor environment. Further, a safe-distance-estimation function was proposed to avoid hazardous objects. It computes the distance of the hazardous object from the robot’s position and steers the robot into a safer zone using detection results and object depth data. The proposed framework was trained with a custom image dataset using fine-tuning techniques and tested in real-time with an in-house-developed mobile cleaning robot, BELUGA. The experimental results show that the proposed algorithm detected the low-featured and occluded hazardous object with a higher confidence level than the conventional object detector and scored an average detection accuracy of 88.71%.
... Other approaches focus on explicit faults of the sensor input, such as overexposure of the camera [16], noise [17], or occlusions in the images using compositional convolutional neural networks [18], [19]. Zhang and Wang [20] proposed an adversarial training approach to make object detectors more robust against adversarial attacks. ...
Conference Paper
Full-text available
Multimodal object detection fuses different sensors such as camera or LIDAR to improve the detection performance. However, individual sensor inputs can also be detrimental to a system, for example when sun glare hits a camera. In this work, we propose to monitor each sensor individually to predict when an input would lead to incorrect detections. We first train one detection network for each sensor separately, using only that sensor as input. Then, we record the performance for each single-sensor network and train an introspective performance prediction network for each sensor. Finally, we train a multimodal fusion network where we weight the impact of each sensor with its predicted performance. This allows us to dynamically adapt the fusion to reduce the influence of harmful sensor readings based only on the current data. We apply the proposed concept to the state-of-the-art AVOD architecture and evaluate on the KITTI data set. The proposed sensor monitoring system improves the mean intersection-over-union performance by 4.6%. For inputs with a low predicted performance, the proposed approach outperforms the state of the art by over 10%, demonstrating the potential of using individual sensor monitoring to react to problematic input. The proposed approach can be applied to any fusion network with two or more sensors and could also be used for classification or segmentation tasks.
... In computer vision research, the robustness of human vision has often been praised and regarded as golden standards for designing computer vision models [34,54]. These findings indeed inspire development of robust vision models, such as compositional, recurrent, and occlusion aware models [22,46,47]. In addition to specialty models, much of the idea of using invariant transforms to augment training samples come from the intuition and observation that human vision are robust against these transforms such as object translation, scaling, occlusion, photometric distortions, etc. ...
... As discussed in Section 2, occlusion robustness in both human vision [34,44,54] and computer vision [22,46,47] have been an important property for real world applications of vision models as objects. To assess the effect of soft augmentation on occlusion robustness of computer vision models, ResNet-50 models are tested with occluded ImageNet validation images (Figure 4 and Appendix Figure 7). ...
Preprint
Full-text available
Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies and propose generalizing augmentation with invariant transforms to soft augmentation where the learning target softens non-linearly as a function of the degree of the transform applied to the sample: e.g., more aggressive image crop augmentations produce less confident learning targets. We demonstrate that soft targets allow for more aggressive data augmentation, offer more robust performance boosts, work with other augmentation policies, and interestingly, produce better calibrated models (since they are trained to be less confident on aggressively cropped/occluded examples). Combined with existing aggressive augmentation strategies, soft target 1) doubles the top-1 accuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2) improves model occlusion performance by up to $4\times$, and 3) halves the expected calibration error (ECE). Finally, we show that soft augmentation generalizes to self-supervised classification tasks.
... Much of this state of affairs can be attributed to the fact that occlusions are treated as noise that must be over- come by robust measures [16,17,23,36,52,57]. There are several challenges that make this strategy hard to succeed. ...
... The performances of pedestrian and vehicle detection and segmentation improve significantly on all cameras. Like in [32,49,57], we report performances at different levels of occlusion and show that the performance drops more slowly as occlusion increases, compared to methods that do not use longitudinal self-supervision. Because of this, we achieve strong results in detecting and tracking objects as they pass each other -a common failure mode of existing approaches. ...
Conference Paper
Full-text available
Current methods for object detection, segmentation, and tracking fail in the presence of severe occlusions in busy urban environments. Labeled real data of occlusions is scarce (even in large datasets) and synthetic data leaves a domain gap, making it hard to explicitly model and learn occlu-sions. In this work, we present the best of both the real and synthetic worlds for automatic occlusion supervision using a large readily available source of data: time-lapse imagery from stationary webcams observing street intersections over weeks, months, or even years. We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of 12 (4K and 1080p) cameras capturing urban environments over a year. We exploit this real data in a novel way to automatically mine a large set of unoccluded objects and then composite them in the same views to generate occlusions. This longitudinal self-supervision is strong enough for an amodal network to learn object-occluder-occluded layer representations. We show how to speed up the discovery of unoccluded objects and relate the confidence in this discovery to the rate and accuracy of training occluded objects. After watching and automatically learning for several days, this approach shows significant performance improvement in detecting and segmenting occluded people and vehicles, over human-supervised amodal approaches.
... Because of the small size of weapon objects and the distance from the CCTV, the weapons were partially or fully concealed in many frames. In similar research of occluded object detection, Wang et al. [57] created a dataset with nine occlusion levels over two dimensions. Researchers distinguished three degrees of object occlusion: FG-L1 (20-40%), FG-L2 (40-60%), and FG-L3 (60-80%). ...
... In similar research of occluded object detection [57], the challenge of recognizing items under occlusion and discovering that traditional deep learning algorithms that mix proposal networks with classification networks cannot recognize partially occluded objects robustly, but the experimental results showed that occluded weapon detection can be resolved. While our weapon labels also included some partial occluded weapon objects, the detector was able to recognize more occluded weapons in the Tiling ACF Dataset, where the tiling methods helped in increasing partial weapon object recognition. ...
Article
Full-text available
Thailand, like other countries worldwide, has experienced instability in recent years. If current trends continue, the number of crimes endangering people or property will expand. Closed-circuit television (CCTV) technology is now commonly utilized for surveillance and monitoring to ensure people’s safety. A weapon detection system can help police officers with limited staff minimize their workload through on-screen surveillance. Since CCTV footage captures the entire incident scenario, weapon detection becomes challenging due to the small weapon objects in the footage. Due to public datasets providing inadequate information on our interested scope of CCTV image’s weapon detection, an Armed CCTV Footage (ACF) dataset, the self-collected mockup CCTV footage of pedestrians armed with pistols and knives, was collected for different scenarios. This study aimed to present an image tilling-based deep learning for small weapon object detection. The experiments were conducted on a public benchmark dataset (Mock Attack) to evaluate the detection performance. The proposed tilling approach achieved a significantly better mAP of 10.22 times. The image tiling approach was used to train different object detection models to analyze the improvement. On SSD MobileNet V2, the tiling ACF Dataset achieved an mAP of 0.758 on the pistol and knife evaluation. The proposed method for enhancing small weapon detection by using the tiling approach with our ACF Dataset can significantly enhance the performance of weapon detection.
... However, most prior work on 6D pose estimation focused on the "instance-level" task, where exact CAD models of the object instances are available [35,23,19,10,12]. Moreover, the few prior methods on "category-level" 6D pose estimation often either rely on a ground truth depth map [31,20], which are practically hard to obtain in many application areas, or rely on 2D bounding box proposals [38,28], which are not reliable in challenging occlusion scenarios [30] (see also our experimental results). ...
... The core problem of such a rendering-based approach to pose estimation is to search efficiently through the combinatorially large space of the 6D latent parameters, because the iterative rendering process is rather costly compared to simple feed-forward regression approaches. Related work addresses this problem by first estimating 2D object bounding boxes with a proposal network [18,31,12], but these are not reliable under partial occlusion and truncation [30]. Instead, we address this problem by extending neural mesh models with scale-invariant features and a coarse-to-fine render-and-compare optimization strategy, which retains the robustness to partial occlusion ( Figure 1). ...
... Our experiments demonstrate that our model outperforms strong object detection and pose estimation baseline models. Our model further demonstrates exceptional robustness to partial occlusion compared to all baseline methods on the Occluded PASCAL3D+ dataset [30]. ...
Preprint
Full-text available
We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion.
... Hypernet method was proposed to integrate three different depth characteristics: shallow, medium and deep [28,47]. In order to better represent the features of the object, our proposed MSRNet integrates the features of four different depths: shallow, medium, deep and deeper [48]. The MFTRF structure is proposed by combining four different depth features [49][50][51]. ...
Article
Full-text available
Many residual network-based methods have been proposed to perform object detection. However, most of them may lead to overfitting or cannot perform well in small object detection and alleviate the problem of overfitting. We propose a multiple spatial residual network (MSRNet) for object detection. Particularly, our method is based on central point detection algorithm. Our proposed MSRNet employs a residual network as the backbone. The resulting features are processed by our proposed residual channel pooling module. We then construct a multi-scale feature transposed residual fusion structure consists of three overlapping stacked residual convolution modules and a transpose convolution function. Finally, we use the Center structure to process the high-resolution feature image for obtaining the final prediction detection result. Experimental results on PASCAL VOC dataset and COCO dataset confirm that the MSRNet has competitive accuracy compared with several other classical object detection algorithms, while providing a unified framework for training and reasoning. The MSRNet runs on GeForce RTX 2080Ti.
... The occluding image test is shown in Figure 10. In most cases, masking could cause some feature points to disappear; for example, the ankle joint could not be distinguished [33,34]. However, the predicted angle value could be correctly output in most cases. ...
Article
Full-text available
Distance and depth detection plays a crucial role in intelligent robotics. It enables drones to understand their working environment to avoid collisions and accidents immediately and is very important in various AI applications. Image-based distance detection usually relies on the correctness of geometric information. However, the geometric features will be lost when the object is rotated or the camera lens image is distorted. This study proposes a training model based on a convolutional neural network, which uses a single-lens camera to estimate humans' distance in continuous images. We can partially restore depth information loss using built-in camera parameters that do not require additional correction. The normalized skeleton feature unit vector has the same characteristics as time series data and can be classified very well using a 1D convolutional neural network. According to our results, the accuracy for the occluded leg image is over 90% at 2 to 3 m, 80% to 90% at 4 m, and 70% at 5 to 6 m.