# Machine Vision and Applications

Online ISSN: 1432-1769
Print ISSN: 0932-8092
Recent publications
• Qianjin Yuan
• Jing Chang
• Yong Luo
• [...]
• Dongshu Wang
The point cloud segmentation of a substation device attached with cables is the basis of substation identification and reconstruction. However, it is limited by a number of factors including the huge amount of point cloud data of a substation device, irregular shape, unclear feature distinction due to the auxiliary point cloud data attached to the main body of a device. Therefore, the segmentation efficiency of a substation device is very low. In order to improve the accuracy and efficiency of the point cloud segmentation, this paper proposes a method to segment the attached cables point cloud of a substation device by using the shape feature of point cloud. Firstly, according to the spatial position of the point cloud of a substation device, octree is used to conduct voxelization of the point cloud, and the point cloud resampling is operated according to point cloud density of each voxel, so as to reduce original point cloud data and improve computing efficiency. Then Mean Shift algorithm is used to locate the center axis of the point cloud, and cylinder growth method is used to initially segment cables data and locate the end of each cable. Finally, points of the end are used as seed points to carry out a region growth based on shape feature of the point cloud to realize effective segmentation of cables data. In the experiment, 303 sets of point cloud of devices are selected, including circuit breaker, voltage transformer, transformer, etc. The final result shows that the successful segmentation rate of this method reaches 95.34%, which effectively proves the feasibility of this method.

Feature-level or pixel-level fusion is a common technique for integrating different modes of information in RGB-T object tracking. A good fusion method between modalities can significantly improve the tracking performance. In this paper, a multi-modal and multi-level fusion model based on Siamese network (SiamMMF) is proposed. SiamMMF consists of two main subnetworks: a pixel-level fusion network and a feature-level fusion network. The pixel-level fusion network fuses the infrared images and the visible light images by taking the maximum values of the pixels corresponding to the different images, and the combined images are used to replace the visible light images. The infrared images and the visible light images are each input to the backbone with dual-stream structure for processing. After the extraction of deep features, the visible and infrared features from the two branches are cross-correlated to obtain a fusion result that is sent to the tracking head for tracking. Based on numerous experiments, it was found that the best tracking effect is obtained when the weighting ratio between the visible and infrared modality is set to 6:4. Nineteen pairs of RGB-T video sequences with different attributes were used to test our model and compared it with 15 trackers. For the two evaluation criteria, success rate and precision rate, our network achieved the best results.

The corona virus pandemic has introduced limitations which were previously not a cause for concern. Chief among them are wearing face masks in public and constraints on the physical distance between people as an effective measure to reduce the virus spread. Visual surveillance systems, which are common in urban environments and initially commissioned for security surveillance, can be re-purposed to help limit the spread of COVID-19 and prevent future pandemics. In this work, we propose a novel integration technique for real-time pose estimation and multiple human tracking in a pedestrian setting, primarily for social distancing, using CCTV camera footage. Our technique promises a sizeable increase in processing speed and improved detection in very low-resolution scenarios. Using existing surveillance systems, pedestrian pose estimation, tracking and localization for social distancing (PETL4SD) is proposed for measuring social distancing, which combines the output of multiple neural networks aided with fundamental 2D/3D vision techniques. We leverage state-of-the-art object and pose estimation algorithms, combining their strengths, for increase in speed and improvement in detections. These detections are then tracked using a bespoke version of the FASTMOT algorithm. Temporal and analogous estimation techniques are used to deal with occlusions when estimating posture. Projective geometry along with the aforementioned posture tracking is then used to localize the pedestrians. Inter-personal distances are calculated and locally inspected to detect possible violations of the social distancing rules. Furthermore, a “smart violations detector” is employed which estimates if people are together based on their current actions and eliminates false social distancing violations within groups. Finally, distances are intuitively visualized with the right perspective. All implementation is in real time and is performed on Python. Experimental results are provided to validate our proposed method quantitatively and qualitatively on public domain datasets using only a single CCTV camera feed as input. Our results show our technique to outperform the baseline in speed and accuracy in low-resolution scenarios. The code of this work will be made publicly available on GitHub at https://github.com/bilalze/PETL4SD.

This paper proposes a novel method for enhancing the dynamic range of structured-light cameras to solve the problem of highlight that occurs when 3D modeling highly reflective objects using the structured-light method. Our method uses the differences in quantum efficiency between R, G, and B pixels in the color image sensor of a monochromatic laser to obtain structured-light images of an object under test with different luminance values. Our approach sacrifices the resolution of the image sensor to increase the dynamic range of the vision system. Additionally, to enhance our system, we leverage the backgrounds of structured-light stripe pattern images to restore the color information of measured objects, whereas the background is often removed as noise in other 3D reconstruction systems. This reduces the number of cameras required for 3D reconstruction and the matching error between point clouds and color data. We modeled both highly reflective and non-highly reflective objects and achieved satisfactory results.

Clustering approaches based on similarity learning have achieved good results, but they still have the following problems: (1) these approaches generally learn similar expressions on the original data, thereby disregarding the nonlinear structure of the data; (2) these methods generally do not consider the consistency and high-order relevance among multi-view data; and (3) these approaches generally use the learned similarity matrix for clustering, usually not achieving the optimal effect. To resolve the above issues, we present a new approach referred to as consensus similarity learning based on tensor nuclear norm. First, to address the first problem, we map the data of each view to the Hilbert space to discover the nonlinear structure of the data. Second, to address the second problem, we introduce the tensor nuclear norm to constrain the regularization term, and then, the consistency and high-order relevance among multi-view data can be captured. Third, to address the third problem, i.e., to obtain a better clustering effect, we learn a clustering indicator matrix in the kernel space instead of a similarity matrix for clustering by using a consensus representation term. Last, we incorporate these three steps into a unified framework and design the corresponding goal function. In addition, experimental outcomes on some datasets show that our algorithm is superior to certain representative approaches.

Attention mechanism has been extensively employed in the task of person re-identification, as it helps to extract much more discriminative feature representations. However, most of existing works either incorporate a single-scale attention module, or the embedded attentions work independently. Though promising results are achieved, they may fail to mine different subtle visual clues. To mitigate this issue, a novel framework called cascaded attention network (CANet) is proposed, which allows to mine diverse clues and integrate them into final multi-granularity features by a cascaded manner. Specifically, we design a novel hybrid pooling attention module (HPAM) and plug it into backbone network at different stages. To make them work collaboratively, an inter-attention regularization is applied, such that they can localize complementary salient features. Then, CANet extracts global and local features from a part-based pyramidal architecture. For better feature robustness, supervision is applied to not only the pyramidal branches, but also those intermediate attention modules. Furthermore, within each supervision branch, hybrid pooling with two different strides is executed to enhance feature representation capabilities. Extensive experiments with ablation analysis demonstrate the effectiveness of the proposed method, and state-of-the-art results are achieved on three public benchmark datasets, including Market-1501, CUHK03, and DukeMTMC-ReID.

Human pose estimation based on deep learning have attracted increasing attention in the past few years and have shown superior performance on various datasets. Many researchers have increased the number of network layers to improve the accuracy of the model. However, with the deepening of the number of network layers, the parameters and computation of the model are also increasing, which makes the model unable to be deployed on edge devices and mobile terminals with limited computing power, and also makes many intelligent terminals limited in volume, power consumption and storage. Inspired by the lightweight method, we propose a human pose estimation model based on the lightweight network to solve those problems, which designs the lightweight basic block module by using the deep separable convolution and the reverse bottleneck layer to accelerate the network calculation and reduce the parameters of the overall network model. Experiments on COCO dataset and MPII dataset prove that this lightweight basicblock module can effectively reduce the amount of parameters and computation of human pose estimation model.

Motion cue is pivotal in moving object analysis, which is the root for motion segmentation and detection. These preprocessing tasks are building blocks for several applications such as recognition, matching and estimation. To devise a robust algorithm for motion analysis, it is imperative to have a comprehensive dataset to evaluate an algorithm’s performance. The main limitation in making these kind of datasets is the creation of ground-truth annotation of motion, as each moving object might span over multiple frames with changes in size, illumination and angle of view. Besides the optical changes, the object can undergo occlusion by static or moving occluders. The challenge increases when the video is captured by a moving camera. In this paper, we tackle the task of providing ground-truth annotation on motion regions in videos captured from a moving camera. With minimal manual annotation of an object mask, we are able to propagate the label mask in all the frames. Object label correction based on static and moving occluder is also performed by applying occluder mask tracking for a given depth ordering. A motion annotation dataset is also proposed to evaluate algorithm performance. The results show that our cascaded-naive approach provides successful results. All the resources of the annotation tool are publicly available at http://dixie.udg.edu/anntool/.

Scene text with occlusions is common in the real world, and occluded text recognition is important for many machine vision applications. However, corresponding techniques are not well explored as public datasets cannot represent the situation well, and methods designed for occluded text are still scarce. In this work, we discuss different kinds of occlusions and propose an occluded scene text enhancing network to improve recognition performance. The network is based on generative adversarial networks, and we design accretion blocks to help the network generate the occluded image regions. The model is independent of the recognition networks, so it can be readily used in different frameworks and can be easily trained without the annotations of text content. We also refine the training objective to improve the framework. Experiments on several public benchmarks demonstrate that the proposed method effectively enhances occluded text images, improving recognition accuracy by over 10% on several state-of-the-art frameworks. Meanwhile, the network has no severe impact on the text images without occlusions.

Re-identifying objects in a rigid scene across varying viewpoints (object Re-ID) is a challenging task, in particular when there are similar, even identical objects coexist in the same environment. Discriminative features play no doubt an essential role in addressing this challenge, while for practical deployment, real-time performance is another desired attribute. We therefore propose a novel framework, named Fast re-OBJ, that is able to improve both Re-ID accuracy and processing speed via tight coupling between the instance segmentation module and embedding generation module. The rich object encoding in the instance segmentation backbone is directly shared to the embedding generation module for training a more discriminative representation via a triplet network. Moreover, we create datasets with the segmentation outputs using real-time object detectors to train and evaluate our object embedding module. With extensive experiments, we prove that our proposed Fast re-OBJ improves the object Re-ID accuracy by 5% and the speed is 5×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times$$\end{document} faster compared to the state-of-the-art methods. The dataset and code repository are publicly available at: https://tinyurl.com/bdsb53c4.

The computer-aided diagnosis technology needs to determine whether the bone image is abnormal and needs to locate the location of the lesions accurately. However, there are few publicly available detection datasets of bone lesions. Therefore, for the first, this paper proposes the abnormal detection dataset MURA-objects, which is relabeled based on the large-scale radioactive bone image dataset Musculoskeletal Radiograph. MURA-objects consist of 8431 images, including 7579 in the training set and 852 in the validation set. There are 8933 metal objects and 740 fracture objects in the training images and 1107 metal objects and 124 fracture objects in the validation images. We also give the baseline results using state-of-the-art methods such as faster RCNN, SSD, and YOLOv3, which lays a foundation for future bone imaging lesion detection research. The MURA-objects dataset can be found at https://github.com/wangxin1216/MURA-Objects.

We introduce a novel deep learning-based group activity recognition approach called the Pose Only Group Activity Recognition System (POGARS), designed to use only tracked poses of people to predict the performed group activity. In contrast to existing approaches for group activity recognition, POGARS uses 1D CNNs to learn spatiotemporal dynamics of individuals involved in a group activity and forgo learning features from pixel data. The proposed model uses a spatial and temporal attention mechanism to infer person-wise importance and multi-task learning for simultaneously performing group and individual action classification. Experimental results confirm that POGARS achieves highly competitive results compared to state-of-the-art methods on a widely used public volleyball dataset despite only using tracked pose as input. Further, our experiments show by using pose only as input, POGARS has better generalization capabilities compared to methods that use RGB as input.

In this work, we propose a two-stage architecture to perform image inpainting from coarse to fine. The framework extracts advantages from different designs in the literature and integrates them into the inpainting network. We apply region normalization to generate coarse blur results with the correct structure. Then, contextual attention is applied to utilize the texture information of background regions to generate the final result. Although using region normalization can improve the performance and quality of the network, it often results in visible color shifts. To solve this problem, we introduce perceptual color distance in the loss function. In quantitative comparison experiments, the proposed method is superior to the existing similar methods in Inception Score, Fréchet Inception Distance, and perceptual color distance. In qualitative comparison experiments, the proposed method can effectively resolve the problem of color shifts.

In the process of vision-assisted robot automatic welding, splash and arc noise will interfere with the useful information in the image, which has a serious impact on the subsequent vision-based weld seam recognition and tracking algorithm. Although various methods have been proposed to achieve noise reduction, those are implemented from a perspective of image processing or intelligent algorithm. In this paper, we propose a pairwise response noise reduction model (PRNM) from the imaging perspective of seam tracking camera, which provides an image processing-free method to suppress welding splash and arc. Inverse response law and the loss function is established and solved by using singular value decomposition, and then the irradiance of reference scene is restored. A tone mapping approach is proposed based on the advantages of full-range compression and joint estimator, followed by mapping the obtained irradiance distribution to a high dynamic range (HDR) image. The grayscale correspondences of two reference images and the HDR image are reflected in the form of a two-dimensional lookup table (LUT). For welding noise characteristics, an alignment strategy is proposed to further correct the LUT to form a PRNM, which responds to all pairwise inputs with high computational efficiency. A batch of PRNMs have potential to form imaging solutions serving a welding process library. Experimental results reveal that the proposed model has a good effect in noise reduction, and the real-time performance is suitable for weld seam tracking.

Multi-focus image fusion, which is the fusion of two or more images focused on different targets into one clear image, is a worthwhile problem in digital image processing. Traditional methods are usually based on frequency domain or space domain, but they cannot guarantee the accurate measurement of all the image details of the activity level, and also cannot perfect the selection of image fusion rules. Therefore, the deep learning method with strong feature representation ability is called the mainstream of multi-focus image fusion. However, until now, most of the deep learning frameworks have not balanced the relationship between the two input features, the shallow features and the feature fusion. In order to improve the defects of previous work, we propose an end-to-end deep network, which includes an encoder and a decoder. Encoder is a pseudo-Siamese network. It extracts the same and different feature sets by using the features of double encoder, then reuses the shallow features and finally forms the coding. In decoder, the coding will be analyzed and dimensionally reduced enough to generate high-quality fusion image. We carried out extensive experiments. The results show that our network structure is better. Compared with various image fusion methods based on deep learning and traditional multi-focus image fusion methods in recent years, our method is slightly better than theirs in both objective metric contrast and subjective visual contrast.

The recognition of arrow markings in intelligent transportation system can be posed as a template matching problem, because their production has to conform with strict national standards. However, appearance variations often arise when capturing the arrow images due to different eye views. In this paper, we propose a discriminant distance template matching (DDTM) method for image recognition. It is compatible with not only hand-crafted features, but also DCNNs for features auto-learning. DDTM is able to discriminate whether an instance and a template matched. Consequently it can realize the arrow recognition through template matching. We have also shown some traditional image recognition tasks can be solved by DDTM. Experimental results on an arrow and the MNIST datasets validate its advantages compared to classification-based methods.

Hyperspectral images (HSI) contain rich ground object information, which has great potential in classification. However, the large amount of data and noise also pose a challenge to HSI classification. In this paper, a new framework based on band selection and multi-scale structure features is proposed, which mainly consists of the following steps. Firstly, the spectral dimension of the HSI is reduced with the clustering average method based on information divergence. Secondly, the detailed multi-scale structure features of HSI are extracted by using multi-parameter relative total variation. Thirdly, in order to reduce noise and highlight structural features, bilateral filtering is used to fine-tune the extracted structural features. Finally, the improved quantum particle swarm optimization algorithm is proposed to optimize the parameters of SVM. A lot of experiment results on two hyperspectral datasets show that the proposed method performs better than several state-of-the-art methods.

Kinship verification from facial images in the wild based on one-to-one classification has gathered a promising attention by image processing and computer vision researchers. While family classification based on one-to-many classification is relatively the least explored domain in computer vision. This paper first performs family classification on different family-sets based on number of family members. Second, we perform kinship verification on different kinship relations covering parent–child and siblings. We present a new kinship database named KinIndian dedicated for these two tasks of family classification and kinship verification. KinIndian database comprises 1926 images of 813 individuals from 230 unique Indian families with 2–7 members. KinIndian is designed into two levels: the first is family-level for family classification, and the second is photo-level for kinship verification. We propose a novel weighted nearest member metric leaning (WNMML) method to evaluate family classification on different family-sets. Proposed WNMML method is based on minimizing intraclass separation by characterizing compactness for positive families and maximizing interclass separation by pushing members of negative families as far as possible. WNMML achieves competitive accuracy on different family-sets and hence shows that WNMML could be effectively used in real-world scenarios. Furthermore, we also perform kinship verification on KinIndian using baseline multimetric learning methods and achieves promising and encouraging kinship accuracy.

We propose an image alignment algorithm based on weak supervision, which aims to identify the correspondence between a pair of reference and target images with no supervision of individual pixels. Since most existing methods have relied on a predefined geometric model such as homography, they often suffer from a lack of model flexibility and generalizability. To tackle the challenge, we propose a novel nonparametric transformation model based on graph convolutional networks without an explicit geometric constraint. The proposed method is generic and flexible in the sense that it is applicable to the image pairs undergoing diverse local and/or global transformations. To make the algorithm more suitable for real-world scenarios having potential noises from moving objects, we disregard those objects with an off-the-shelf semantic segmentation model. The proposed algorithm is evaluated on the Cityscapes dataset with annotated pixel-level correspondences and outperforms baseline methods relying on global parametric transformations.

Precision farming requires tree-canopy information for better management. Stereo vision is the technique to create a 3D model, and it needs to be adequately setup to avoid extreme data processing and unreliable results. Features detection is very important. Different parameters affect features in images. Because 3D accuracy is necessary, this study focused to investigate the effects of various baselines of a stereo camera on the well-known combination of feature detectors and descriptors and optimization of a stereo-vision-system for obtaining 3D-model of tree-canopy. Also, the effects of different parameters were investigated in RGB and Y color spaces. These parameters were three levels of density, two shapes of canopy (conic and ellipse), image rectification and un-distortion, metering mode, exposure time and ISO speed. The results showed that the best system was stereo-system with baseline of 12 cm and the best combination was SURF-BRISK. Also, SURF-FREAK and SURF-SURF combinations were appropriate afterwards. The precision value was 1 for the SURF-BRISK combination in the system with the baseline of 12 cm. The parameters including image rectification, metering mode, exposure time and ISO speed were affected by combinations performance. Images must be rectified before the implementation of detector algorithms. Use of the pattern mode and same exposure times and ISO speeds for both pair images were better. The recall values were decreased for various exposure times and ISO speeds. The results of algorithms were not affected by the tree-canopy shapes and density. So results can be used successfully for trees with larger size and different shapes and densities.

Robot kinematic calibration is an effective way to reduce the errors of kinematic parameters and improve the positioning accuracy of a robot. This paper presents a cost-effective kinematic calibration method for a hexapod robot that only needs a monocular camera and two planar markers. The markers are attached to the body and the foot-tip of the robot separately, and the robot’s six legs are calibrated one by one. The kinematic model and error model of the robot are established based on the local product of exponential (POE) model, and the calibration task is formulated as a nonlinear least squares problem where 24 unknown parameters are estimated for each leg. The proposed calibration procedure is successfully evaluated on a real hexapod robot, and the experimental results show that the robot can have a better walking performance after calibration.

Due to the complex underwater imaging environment and illumination conditions, underwater images have some quality degradation problems, such as low contrast, color distortion, texture blur and uneven illumination, which seriously restrict the application in underwater work. In order to solve these problems, we proposed a multi-scale feature fusion CNN based on underwater imaging model in this paper called Multi-Scale Convolution Underwater Image Restoration Network (MSCUIR-Net). Unlike most previous models that estimated the background light and transmittance, respectively, our model unifies the two parameters into one, predicts the univariate linear physical model through lightweight CNN, and directly generates end-to-end clean images. Based on the underwater imaging model, we synthesized the underwater image training set can simulate the shallow water to deep water environment. Then, we do experiments on synthetic images and real underwater images, and prove the superiority of this method through image evaluation indexes. The experimental results show that MSCUIR-Net has a good effect on underwater image restoration.

At present, the parametric active contour model is one of the most well-known and widely used image segmentation techniques in image processing and computer vision. However, its evolution computation is slow, which is a great obstacle to some applications such as real-time motion tracking. This paper not only reveals its bottleneck including the high computation cost of the inverse operation of matrix and the matrix multiplication in each iteration, but also proposes a novel scheme that transfers these time-consuming matrix operations into vector convolution operations for better performance. As shown by simulation results the proposed algorithm is always much faster than the conventional algorithm, and the velocity gain increases with the snaxels on the curve, from several times to over 2 orders of magnitude.

Graph convolutional networks (GCNs) have been successfully introduced in skeleton-based human action recognition. Both human skeletons and hand skeletons are composed of open-loop chains, and each chain is composed of rigid links (corresponding to bones) and revolving pairs (corresponding to joints). Despite this similarity, there has been no skeleton-based hand action recognition method that represents hand skeletons using GCNs. We first evaluate the effectiveness of traditional spatial–temporal GCNs for skeleton-based hand action recognition. Then, we propose to improve the traditional spatial–temporal GCNs by incorporating the third-order node information (geometric relationships between neighbor connected bones in a hand skeleton), and the geometric relationships are described by a Lie group, including relative translations and rotations. Finally, we study first-person multimodal hand action recognition with hand skeletons, RGB images, and depth maps jointly used as visual input. We propose to fuse the multimodal features by customized long short-term memory (LSTM) units, rather than simply concatenating them as a feature vector. Extensive ablation studies are conducted to demonstrate the improvements due to the use of the third-order node information and the advantages of our multimodal fusion strategy. Our method markedly outperforms recent baselines on a public first-person hand action recognition dataset.

3D human pose estimation has achieved much progress with the development of convolution neural networks. There still have some challenges to accurately estimate 3D joint locations from single-view images or videos due to depth ambiguity and severe occlusion. Motivated by the effectiveness of introducing vision transformer into computer vision tasks, we present a novel U-shaped spatial–temporal transformer-based network (U-STN) for 3D human pose estimation. The core idea of the proposed method is to process the human joints by designing a multi-scale and multi-level U-shaped transformer model. We construct a multi-scale architecture with three different scales based on the human skeletal topology, in which the local and global features are processed through three different scales with kinematic constraints. Furthermore, a multi-level feature representations is introduced by fusing intermediate features from different depths of the U-shaped network. With a skeletal constrained pooling and unpooling operations devised for U-STN, the network can transform features across different scales and extract meaningful semantic features at all levels. Experiments on two challenging benchmark datasets show that the proposed method achieves a good performance on 2D-to-3D pose estimation. The code is available at https://github.com/l-fay/Pose3D.

In this paper, we propose a variational autoencoder (VAE) and a VAE-generative adversarial net (GAN) trained to generate from 12000 Ising granularity images, new and appropriate images, which can retain the former′s\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${}'s$$\end{document} global chaotic structure to some extent. Via VAE, we project high-dimensional Ising granularity images onto a two-dimensional latent space in which some spatial distribution patterns are explored. The observed particles in latent space electronic cloud are similar to that of the quantum dynamics integrable pattern. The resulting VAE latent space is a new measurement space to explore both the spatial particle distribution patterns and the structural topology clusters, leading to recognition of new classification/clustering patterns of the physical state/phase, which extend those found via traditional approaches which consider pixels of an image as physical particles. In addition, we propose a multiple-level structural similarity image quality assessment (IQA) scheme to measure inter- and intra-patch similarities on VAE and VAE–GAN generate images when they are split into patches. The results show that this novel IQA scheme can both maximize the distances of the samples among inter-classes and minimize those of the intra-classes, without compromising the image fidelity and features.

This article presents a machine learning approach to automatic land use categorization based on a convolutional artificial neural network architecture. It is intended to support the detection and classification of building facades in order to associate each building with its respective land use. Replacing the time-consuming manual acquisition of images in the field and subsequent interpretation of the data with computer-aided techniques facilitates the creation of useful maps for urban planning. A specific future objective of this study is to monitor the commercial evolution in the city of Vila Velha, Brazil. The initial step is object detection based on a deep network architecture called Faster R-CNN. The model is trained on a collection of street-level photographs of buildings of desired land uses, from a database of annotated images of building facades. Images are extracted from Google Street View scenes. Furthermore, in order to save manual annotation time, a semi-supervised dual pipeline method is proposed that uses a pre-trained predictor model from the Places365 database to learn unannotated images. Several backbones were connected to the Faster R-CNN architecture for comparisons. The experimental results with the VGG backbone show an improvement over published works, with an average accuracy of 86.49%.

Despite the recent progress of continuous sign language translation-based video, a variety of deep learning models are difficult to apply to the real-time translation in the limit computing resource. We present the two-stream lightweight sign transformer network model for recognizing and translating continuous sign language. This lightweight framework can obtain both static spatial information and all body dynamic features of signer, and the transformer-style decoder architecture to real-time translate sentences from the spatio-temporal context around the signer. Additionally its attention mechanism focus on moving hands and mouth of signer, which is often crucial for semantic understanding of sign language. In this paper, we introduce the Chinese sign language corpus of the business scene which consists of 3080 videos of high quality. The Chinese sign language corpus of the business scene has enormous impetuses for further research on the Chinese sign language translation. Experiments are carried out the PHOENIX-Weather 2014T (Camgoz et al, in: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR 2018), pp 7784–7793, 2018), Chinese Sign Language dataset Huang et al, in: The thirty-second AAAI conference on artificial intelligence (AAAI-18), pp 2257–2264, 2018) and our CSLBS, the proposed model outperforms the state-of-the-art in inference times and accuracy using only raw RGB and RGB difference frames as input.

We present CrossInfoMobileNet, a hand pose estimation convolutional neural network based on CrossInfoNet, specifically tuned to mobile phone processors through the optimization, modification, and replacement of computationally critical CrossInfoNet components. By introducing a state-of-the-art MobileNetV3 network as a feature extractor and refiner, replacing ReLU activation with a better performing H-Swish activation function, we have achieved a network that requires 2.37 times less multiply-add operations and 2.22 times less parameters than the CrossInfoNet network, while maintaining the same error on the state-of-the-art datasets. This reduction of multiply-add operations resulted in an average 1.56 times faster real-world performance on both desktop and mobile devices, making it more suitable for embedded applications. The full source code of CrossInfoMobileNet including the sample dataset and its evaluation is available online through Code Ocean.

Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder–decoder models, particularly, the attention-based models (say Transformer). However, the existing transformer-based models may not fully exploit the semantic context, that is, only using the left-to-right style of context but ignoring the right-to-left counterpart. In this paper, we introduce a bidirectional (forward-backward) decoder to exploit both the left-to-right and right-to-left styles of context for the Transformer-based video captioning model. Thus, our model is called bidirectional Transformer (dubbed BiTransformer). Specifically, in the bridge of the encoder and forward decoder (aiming to capture the left-to-right context) used in the existing Transformer-based models, we plug in a backward decoder to capture the right-to-left context. Equipped with such bidirectional decoder, the semantic context of videos will be more fully exploited, resulting in better video captions. The effectiveness of our model is demonstrated over two benchmark datasets, i.e., MSVD and MSR-VTT,via comparing to the state-of-the-art methods. Particularly, in terms of the important evaluation metric CIDEr, the proposed model outperforms the state-of-the-art models with improvements of 1.2% in both datasets.

Small object detection techniques have been developed for decades, but one of key remaining open challenges is detecting tiny objects in wild or nature scenes. While recent works on deep learning techniques have shown a promising potential direction on common object detection in the wild, their accuracy and robustness on tiny object detection in the wild are still unsatisfactory. In this paper, we target at studying the problem of tiny pest detection in the wild and propose a new effective deep learning approach. It builds up a global activated feature pyramid network on convolutional neural network backbone for detecting tiny pests across a large range of scales over both positions and pyramid levels. The network enables retrieving the depth and spatial intension information over different levels in the feature pyramid. It makes variance or changes of spatial or depth-sensitive features in tiny pest images more visible. Besides, a hard example enhancement strategy is also proposed to implement fast and efficient training in this approach. The approach is evaluated on our newly built large-scale wide tiny pest dataset containing 27.8K images with 145.6K manually labelled pest objects. The results show that our approach perform well on pest detection with over 71% mAP, which outweighs other state-of-the-art object detection methods.

In the multi-focus image fusion task, how to better balance the clear region information of the original image with different focus positions is the key. In this paper, a multi-focus image fusion model based on unsupervised learning is designed, and the image fusion task is carried out by two-stage processing. In the training phase, the encoder–decoder structure is adopted and the multi-scale structural similarity is introduced as the loss function for image reconstruction. In the fusion stage, the trained encoder is used to encode the feature of the original image. The spatial frequency is used to distinguish the clear area of the image from the two scales of channel and space, and the pixels with inconsistent discrimination are checked and processed to generate the initial decision diagram. The final image fusion task is carried out after mathematical morphology optimization. The experimental results show that this method has good effect on preserving the texture details and edge information of the focused area of the original image. Compared with the five advanced fusion algorithms, the proposed algorithm has achieved preferential fusion performance.

Micro-cracks are often generated on the concrete structures of long-distance water conveyance projects. Without early detection and timely maintenance, micro-cracks may expand and deteriorate continuously, leading to major structural failure and disastrous results. However, due to the complexity of the underwater environment, many vision-based methods for concrete crack detection cannot be directly applied to the interior surface of water conveyance structures. In view of this, this paper proposes a three-step method to automatically detect concrete micro-cracks of underwater structures during the operation period. First, underwater optical images were preprocessed by a series of algorithms such as global illumination balance, image color correction, and detail enhancement. Second, the preprocessed images were sliced to image patches, which are sent to a convolutional neural network for crack recognition and crack boundary localization. Finally, the image patches containing cracks were segmented by the Otsu algorithm to localize the cracks precisely. The proposed method can overcome issues such as uneven illumination, color distortion, and detail blurring, and can effectively detect and localize cracks in underwater optical images with low illumination, low signal-to-noise ratio and low contrast. The experimental results show that this method can achieve a true positive rate of 93.9% for crack classification, and the identification accuracy of the crack width can reach 0.2 mm.

Co-salient object detection aims to find common salient objects from an image group, which is a branch of salient object detection. This paper proposes a global-guided cross-reference network. The cross-reference module is designed to enhance the multi-level features from two perspectives. From the spatial perspective, the location information of objects with similar appearances must be highlighted. From the channel perspective, more attention must be assigned to channels that indicate the same object category. After spatial and channel cross-reference, the features are enhanced to possess the consensus representation of image group. Next, a global co-semantic guidance module is built to provide hierarchical features with the location information of co-salient objects. Compared with state-of-the-art co-salient object detection methods, our proposed method extracts collaborative information and obtains better co-saliency maps on several challenging co-saliency detection datasets.

Traditional dehazing methods based on restoration are prone to color distortion and noise amplification when dealing with hazy image with large sky area. To improve dehazing effect, we propose a dehazing algorithm based on image structure–texture decomposition and reconstruction. Hazy image is decomposed into high-frequency texture layer and low-frequency structure layer by total variation. Discrete cosine transform is used to generate an image mask to separate sky area and non-sky area. The texture layer is denoised by the mask, and the structure layer is dehazed by dark channel prior. The media transmission is corrected by color attenuation prior. Finally, the denoised texture layer and the dehazed structure layer are reconstructed to obtain the dehazed image. A no-reference image quality assessment is also proposed to evaluate the dehazed images. Experiment results show that, compared with the state-of-the-art methods, our algorithm has better dehazing effect on non-sky area, and the sky area after dehazing is smooth without color distortion and noise.

Mineral segmentation in ceramic thin sections containing different minerals, in which there are no evident and close boundaries, is a rather complex process. The results of such a process are used in archaeology for analyzing the origin and manufacturing techniques of ancient ceramics. In this paper we present a methodology for the segmentation and analysis of thin sections of material segments and reaching some conclusions in a fully automatic way. We employ machine learning and computer vision techniques to analyze a video of the thin section sample, acquired under an optical microscope. When examined under polarized light, the color of segments may vary during sample rotation. This variation is due to the optical properties of the materials and it provides valuable information about the material inclusions in the sample. Using the video as our input, we perform an entire-video segmentation. To accomplish this task, we developed a hierarchical categorical mean-shift-based algorithm. Using the entire-video segmentation we examine the detected segments and gather statistical information about their sizes, shapes and colors and present an overall report about the sample. We tested the algorithm on nine specimens of ancient ceramics, taken from three different Mediterranean sites. The results show clear differences between the sites in the amounts, sizes and shapes of the segments present in the specimens.

A noise-enhanced super-resolution generative adversarial network plus (nESRGAN+) was proposed to improve the enhanced super-resolution GAN (ESRGAN). The contributions of nESRGAN+ generate an impressive reconstructed image with more texture details and greater sharpness. However, the perceptual quality of the output lacks hallucinated details and undesirable artifacts and takes a long time to converge. To address these problems, we propose four types of parametric regularization algorithms as loss functions of the model to enable the iterative weight adjustment of the network gradient. Several experiments were conducted to confirm that the generator can achieve a better-quality reconstructed image, including restoring the unseen texture. Our method accomplished the average peak signal-to-noise ratio (PSNR) of the reconstructed image at 27.96 dB, the average Structural Similarity Index Measure (SSIM) at 0.8303, and the average Learned Perceptual Image Patch Similarity (LPIPS) at 0.1949. It took seven times less training time than the state of the art. In addition to the better visual quality of the reconstructed result, the proposed loss functions allow the generator to converge faster.

For the image fusion method using sparse representation, the adaptive dictionary and fusion rule have a great influence on the multi-modality image fusion, and the maximum L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_{1}$$\end{document} norm fusion rule may cause gray inconsistency in the fusion result. In order to solve this problem, we proposed an improved multi-modality image fusion method by combining the joint patch clustering-based adaptive dictionary and sparse representation in this study. First, we used a Gaussian filter to separate the high- and low-frequency information. Second, we adopted the local energy-weighted strategy to complete the low-frequency fusion. Third, we used the joint patch clustering algorithm to reconstruct an over-complete adaptive learning dictionary, designed a hybrid fusion rule depending on the similarity of multi-norm of sparse representation coefficients, and completed the high-frequency fusion. Last, we obtained the fusion result by transforming the frequency domain into the spatial domain. We adopted the fusion metrics to evaluate the fusion results quantitatively and proved the superiority of the proposed method by comparing the state-of-the-art image fusion methods. The results showed that this method has the highest fusion metrics in average gradient, general image quality, and edge preservation. The results also showed that this method has the best performance in subjective vision. We demonstrated that this method has strong robustness by analyzing the parameter’s influence on the fusion result and consuming time. We extended this method to the infrared and visible image fusion and multi-focus image fusion perfectly. In summary, this method has the advantages of good robustness and wide application.

Recently, the single image super-resolution methods with deep and complex convolutional neural network structures have achieved remarkable performance. However, those approaches improve the performance at the cost of higher memory occupation, which are difficult to be applied for some resource-constrained devices. With the goal of minimizing parameters, an effective and efficient operator named involution is introduced in our proposed model, delivering enhanced performance at reduced cost compared to convolution-based counterparts. On the basis of involution, we propose two building blocks named RMFDB(Residual Mixed Feature Distillation Block) and CICB(Conv-Invo-Conv Block) for the main module and the reconstruction module respectively. RMFDB has the similar structure as the RFDB but with our involution layers. This block is much more lightweight and efficient than conventional convolution-based blocks. CICB combines the nearest-neighbor upsampling, convolution and involution layers. The final reconstruction quality is improved with little parameter cost. Experimental results demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods. Our final model could achieve similar performance as the lightweight networks RFDN and PAN, but with only 224K parameters and 64.2G Multi-Adds with the scale factor of 2. The effectiveness of each proposed components is also validated by ablation study.

Semantic segmentation is a structured prediction problem that heavily relies on expensive annotated image data to train supervised models. Unsupervised domain adaptation has been successful in leveraging synthetic (source) images to build models that generalize well to real (target) image data without annotations. However, previous methods mainly utilize source ground truth for segmentation loss and do not fully utilize them for learning segmentation output structures to guide the target domain. In this work, we exploit similar output structures across domains in order to better segment the target images. Toward this end, we devise an adversarial structured prediction by utilizing a regularizer. This regularizer outputs structured predictions on provided image features. Using an adversarial training setup, we make the structured predictions follow the spatial layout learned from the source ground truth. As a result, even without an explicit alignment between source and target features, our proposed method can adapt well from a source to a target domain. We evaluate our method on different challenging synthetic-2-real benchmarks and validate the effectiveness of the proposed method when compared with the state of the arts.

Recent advances in deep learning have led to tremendous achievements in computer vision applications. Specifically for the tasks of automated human age estimation and nudity detection, modern machine learning can predict whether or not an image contains nudity or the presence of a minor with startling accuracy. Fusing together separate models can make possible to identify instances of child pornography without ever coming into contact with the illicit material during model training. In this paper, a novel framework for automatically identifying Sexually Exploitative Imagery of Children is introduced. It is a synthesis of models for modeling human apparent age and nudity detection. The performance of this approach is thoroughly evaluated on several widely used age estimation and nudity detection datasets. Additionally, preliminary tests were conducted with the help of a local law enforcement agency on a private dataset of SEIC taken from real-world cases with up to 97% accuracy of SEIC video classification.

While domain adaptation has been used to improve the performance of object detectors when the training and test data follow different distributions, previous work has mostly focused on two-stage detectors. This is because their use of region proposals makes it possible to perform local adaptation, which has been shown to significantly improve the adaptation effectiveness. Here, by contrast, we target single-stage architectures, which are better suited to resource-constrained detection than two-stage ones but do not provide region proposals. To nonetheless benefit from the strength of local adaptation, we introduce an attention mechanism that lets us identify the important regions on which adaptation should focus. Our method gradually adapts the features from global, image level to local, instance level. Our approach is generic and can be integrated into any Single-Shot Detector. We demonstrate this on standard benchmark datasets by applying it to both the single-shot detector (SSD) and a recent variant of the You Only Look Once detector (YOLOv5). Furthermore, for equivalent single-stage architectures, our method outperforms the state-of-the-art domain adaptation techniques even though they were designed for specific detectors.

Salient object detection (SOD) aims at highlighting important foreground objects automatically from the background. Most existing SOD methods only employ visible images (RGB images) for salient detection, which limits the performance of real-life applications when encountering challenging scenarios such as low illumination, haze, and smog. In this paper, we take advantage of the RGB and thermal images and propose an Enhancement and Aggregation–Feedback Network (EAF-Net) for SOD. Specifically, to achieve effective complementation between modalities and prevent the interference from noises, we first treat RGB and thermal images equally in the Feature Enhancement Block (FEB), and further, the Global Context Module expands receptive field to obtain the global features and the Top-Feature Enhancement Module suppresses the redundant information that may destroy the original features from the top layer. Subsequently, we embed several Cross Feature Aggregation Modules (CFAMs) into the Aggregation-and-Feedback Decoder to fuse different level features and compensation features for further obtaining comprehensive feature expression. Moreover, a feedback mechanism is adopted to propagate these fused features back into previous layers for refinement and generate saliency maps to decode features in a progressive way. Comprehensive experiments on RGB-T datasets demonstrate that EAF-Net achieves outstanding performance against the state-of-the-art models.

Existing edge detection methods are based on fixed logics, which are not intelligent enough to distinguish useful edges and useless/noise edges. Recent ellipse detection methods developed some excellent algorithms that can still detect ellipses, while a large number of noise edges exist. However, these algorithms are compromised that will lose some precision and recall. This paper proposes a deep learning model that can intelligently distinguish useful edges and useless edges. Therefore, high-quality edge maps with low noise can be obtained. An arc-growing-based ellipse detection method is also proposed to take full advantage of the high-quality edge maps. Experiments are performed to reveal the mechanism of the deep learning model and to verify the performance of the proposed method. The experimental results demonstrate that the proposed method performs far better than the state-of-the-art in terms of precision, recall and the F-measure on industrial images and performs slightly better on natural images.

Object detection based on deep learning has been enormously developed in recent years. However, applying the detectors trained on a label-rich domain to an unseen domain results in performance drop due to the domain-shift. To deal with this problem, we propose a novel unsupervised domain adaptation method to adapt from a labeled source domain to an unlabeled target domain. Recent approaches based on adversarial learning show some effect for aligning the feature distributions of different domains, but the decision boundary would be strongly source-biased for the complex detection task when merely training with source labels and aligning in the entire feature distribution. In this paper, we suggest utilizing image translation to generate translated images of source and target domains to fill in the large domain gap and facilitate a paired adaptation. We propose a hierarchical contrastive adaptation method between the original and translated domains to encourage the detectors to learn domain-invariant but discriminative features. To attach importance to foreground instances and tackle the noises of translated images, we further propose foreground attention reweighting for instance-aware adaptation . Experiments are carried out on 3 cross-domain detection scenarios, and we achieve the state-of-the-art results against other approaches, showing the effectiveness of our proposed method.

Pedestrian detection is a critical problem in many areas, such as smart cities, surveillance, monitoring, autonomous driving, and robotics. AI-based methods have made tremendous progress in the field in the last few years, but good performance is limited to data that match the training datasets. We present a multi-camera 3D pedestrian detection method that does not need to be trained using data from the target scene. The core idea of our approach consists in formulating consistency in multiple views as a graph clique cover problem. We estimate pedestrian ground location on the image plane using a novel method based on human body poses and person’s bounding boxes from an off-the-shelf monocular detector. We then project these locations onto the ground plane and fuse them with a new formulation of a clique cover problem from graph theory. We propose a new vertex ordering strategy to define fusion priority based on both detection distance and vertex degree. We also propose an optional step for exploiting pedestrian appearance during fusion by using a domain-generalizable person re-identification model. Finally, we compute the final 3D ground coordinates of each detected pedestrian with a method based on keypoint triangulation. We evaluated the proposed approach on the challenging WILDTRACK and MultiviewX datasets. Our proposed method significantly outperformed state of the art in terms of generalizability. It obtained a MODA that was approximately 15% and 2% better than the best existing generalizable detection technique on WILDTRACK and MultiviewX, respectively.

Conditional computation for deep neural networks reduces overall computational load and improves model accuracy by running a subset of the network. In this work, we present a runtime dynamically throttleable neural network (DTNN) that can self-regulate its own performance target and computing resources by dynamically activating neurons in response to a single control signal, called utilization. We describe a generic formulation of throttleable neural networks (TNNs) by grouping and gating partial neural modules with various gating strategies. To directly optimize arbitrary application-level performance metrics and model complexity, a controller network is trained separately to predict a context-aware utilization via deep contextual bandits. Extensive experiments and comparisons on image classification and object detection tasks show that TNNs can be effectively throttled across a wide range of utilization settings, while having peak accuracy and lower cost that are comparable to corresponding vanilla architectures such as VGG, ResNet, ResNeXt, and DenseNet. We further demonstrate the effectiveness of the controller network on throttleable 3D convolutional networks (C3D) for video-based hand gesture recognition, which outperforms the vanilla C3D and all fixed utilization settings.

This paper describes the development of a Lévy flights-based ViBe algorithm for foreground detection. It is based on a novel approach, using a particular class of the generalized random walk known as Lévy flights, to improve the spatial update mechanism. This mechanism originally used the uniform probability distribution, and it is responsible for handling the new objects that appear in the scene. The proposed approach speeds up the inclusion process of ghost regions in the background model and makes it faster than the inclusion of real static foreground objects while maintaining the classification performance. The developed detection algorithm was evaluated using inclusion speed and classification tests, the results showing the efficacy of using Lévy flights with ViBe’s updating mechanism. Experimental tests were also undertaken on the proposed algorithm to validate its ability with real images, obtained through a series of experiments performed using a multi-spectral Kinect laser imaging sensor, and also from a public dataset. The experimental results show the high adaptation capability of this algorithm against the background modification and validate its ability to deal with multi-spectral real images. The developed algorithm achieved a better performance in comparison with traditional ViBe algorithms when extracting background and detecting foreground objects.

We present a semi-supervised method for panoptic segmentation based on ConsInstancy regularisation, a novel strategy for semi-supervised learning. It leverages completely unlabelled data by enforcing consistency between predicted instance representations and semantic segmentations during training in order to improve the segmentation performance. To this end, we also propose new types of instance representations that can be predicted by one simple forward path through a fully convolutional network (FCN), delivering a convenient and simple-to-train framework for panoptic segmentation. More specifically, we propose the prediction of a three-dimensional instance orientation map as intermediate representation and two complementary distance transform maps as final representation, providing unique instance representations for a panoptic segmentation. We test our method on two challenging data sets of both, hardened and fresh concrete, the latter being proposed by the authors in this paper demonstrating the effectiveness of our approach, outperforming the results achieved by state-of-the-art methods for semi-supervised segmentation. In particular, we are able to show that by leveraging completely unlabelled data in our semi-supervised approach the achieved overall accuracy (OA) is increased by up to 5% compared to an entirely supervised training using only labelled data. Furthermore, we exceed the OA achieved by state-of-the-art semi-supervised methods by up to 1.5%.

To support the ongoing size reduction in integrated circuits, the need for accurate depth measurements of on-chip structures becomes increasingly important. Unfortunately, present metrology tools do not offer a practical solution. In the semiconductor industry, critical dimension scanning electron microscopes (CD-SEMs) are predominantly used for 2D imaging at a local scale. The main objective of this work is to investigate whether sufficient 3D information is present in a single SEM image for accurate surface reconstruction of the device topology. In this work, we present a method that is able to produce depth maps from synthetic and experimental SEM images. We demonstrate that the proposed neural network architecture, together with a tailored training procedure, leads to accurate depth predictions. The training procedure includes a weakly supervised domain adaptation step, which is further referred to as pixel-wise fine-tuning. This step employs scatterometry data to address the ground-truth scarcity problem. We have tested this method first on a synthetic contact hole dataset, where a mean relative error smaller than 6.2% is achieved at realistic noise levels. Additionally, it is shown that this method is well suited for other important semiconductor metrics, such as top critical dimension (CD), bottom CD and sidewall angle. To the extent of our knowledge, we are the first to achieve accurate depth estimation results on real experimental data, by combining data from SEM and scatterometry measurements. An experiment on a dense line space dataset yields a mean relative error smaller than 1%.

Top-cited authors
• Aalborg University
• Aalborg University
• Rensselaer Polytechnic Institute
• Rensselaer Polytechnic Institute
• Michigan State University