# EURASIP Journal on Image and Video Processing

Online ISSN: 1687-5281
Print ISSN: 1687-5176
Recent publications
Storage of digital data is becoming challenging for humanity due to the relatively short life-span of storage devices. Furthermore, the exponential increase in the generation of digital data is creating the need for constantly constructing new resources to handle the storage of this data volume. Recent studies suggest the use of the DNA molecule as a promising novel candidate which can hold 500 Gbyte/mm3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^3$$\end{document} (1000 times more than HDD drives). Any digital information can be synthesized into DNA in vitro and stored in special tiny storage capsules that can promise reliability for hundreds of years. The stored DNA sequence can be retrieved whenever needed using special machines that are called sequencers. This whole process is very challenging, as the process of DNA synthesis is expensive in terms of money and sequencing is prone to errors. However, studies have shown that when respecting several rules in the encoding, the probability of sequencing error is reduced. Consequently, the encoding of digital information is not trivial, and the input data need to be efficiently compressed before encoding so that the high synthesis cost is reduced. In this paper, we present a survey on the storage of digital data in synthetic DNA, explaining the problem which is tackled by this novel field of study, present the main processes included in the storage workflow as well as the history of different studies and the most well-known algorithms that have been proposed in the bibliography on DNA data storage.

Digital raw images obtained from the data set of various organizations require authentication, copyright protection, and security with simple processing. New Euclidean space point’s algorithm is proposed to authenticate the images by embedding binary logos in the digital images in the spatial domain. Diffie–Hellman key exchange protocol is implemented along with the Euclidean space axioms to maintain security for the proposed work. The proposed watermarking methodology is tested on the standard set of raw grayscale and RGB color images. The watermarked images are sent in the email, WhatsApp, and Facebook and analyzed. Standard watermarking attacks are also applied to the watermarked images and analyzed. The finding shows that there are no image distortions in the communication medium of email and WhatsApp. But in the Facebook platform, raw images experience compression and observed exponential noise on the digital images. The authentication and copyright protection are tested from the processed Facebook images. It is found that the embedded logo could be recovered and seen with added noise distortions. So the proposed method offers authentication and security with compression attacks. Similarly, it is found that the proposed methodology is robust to JPEG compression, image tampering attacks like collage attack, image cropping, rotation, salt-and-pepper noise, sharpening filter, semi-robust to Gaussian filtering, and image resizing, and fragile to other geometrical attacks. The receiver operating characteristics (ROC) curve is drawn and found that the area under the curve is approximately equal to unity and restoration accuracy of [67 to 100]% for various attacks.

Bone age assessment (BAA) evaluates individual skeletal maturity by comparing the characteristics of skeletal development to the standard in a specific population. The X-ray image examination for bone age is tedious and subjective, and it requires high professional skills. Therefore, AI techniques are desired to innovate and improve BAA methods. Most of the BAA method use the whole X-ray image in an end-to-end model directly. Such whole-image-based approaches fail to characterize local changes and provide limited aid for diagnosis and understanding disease progress. To address these issues, we collected and curated a dataset of 2129 cases for the study of BAA with fine-grained skeletal maturity level labels of the 13 ROIs in hand bone based on the expert knowledge from TW method. We designed a four-stage automatic BAA model based on recursive feature pyramid network. Firstly, the palm region was segmented using U-Net, followed by the extraction of multi-target ROIs of hand bone using a recursive feature pyramid network. Given the extracted ROIs, we employed a transfer learning model with attention mechanism to predict the skeletal maturity level of each ROI. Finally, the bone age is assessed based on the percentile curve of bone maturity. The proposed BAA model can automate the BAA. In addition, it provides the detection result of the 13 ROIs and their ROI-level skeletal maturity. The MAE can reach 0.61 years on the dataset with the labeling precision of one year. All the data and annotations used in this paper are released publicly.

Palpation localization is essential for detecting physiological parameters of the radial artery for pulse diagnosis of Traditional Chinese Medicine (TCM). Detecting signal or applying pressure at the wrong location can seriously affect the measurement of pulse waves and result in misdiagnosis. In this paper, we propose an effective and high accuracy regression model using 3-dimensional convolution neural networks (CNN) processing near-infrared picture sequences to locate radial artery upon radius at the wrist. Comparing with early studies using 2-dimensional models, 3Dcnn introduces temporal features with the third dimension to leverage pulsation rhythms, and had achieved superior performance accuracy as 0.87 within 50 pixels at testing resolution of 1024 × 544. Model visualization shows that the additional dimension of the temporal convolution highlights dynamic changes within image sequences. This study presents the great potential of our constructed model to be applied in real wrist palpation location scenarios to bring the key convenience for pulse diagnosis.

This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial–temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial–temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial–temporal attention–convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.

In recent years, deep learning, especially deep convolutional neural networks (DCNN), has made great progress. Many researchers use different DCNN models to detect remote sensing targets. Different DCNN models have different advantages and disadvantages. In this paper, we use YoloV4 as the detector to “fine-tune” various mainstream deep convolutional neural networks on two large public remote sensing data sets−LEVIR data set and DOTA data set to compare the advantages of various networks. This paper analyzes the reasons why the effect of “fine-tuning” convolutional neural networks is sometimes not good, and points out the difficulties of object detection in optical remote sensing images. To improve the detection accuracy of optical remote sensing targets, in addition to “fine-tuning” convolutional neural network, we also provide a variety of adaptive multi-scale feature fusion methods to improve the detection accuracy. In addition, for the large number of parameters generated by deep convolutional neural network, we provide a method to save storage space.

Accurate and automated identification of the deceased victims with dental radiographs plays a significant role in forensic dentistry. The image processing techniques such as segmentation and feature extraction play a crucial role in image retrieval in accordance with the matching image. The raw image undergoes segmentation, feature extraction and distance-based image retrieval. The ultimate goal of the proposed work is the automated quality enhancement of the image by providing advanced enhancement techniques, segmentation techniques, feature extraction, and matching techniques. In this paper, multi-orientation local ternary pattern-based feature extraction is proposed for feature extraction. The grey level difference method (GLDM) is adopted to extract the texture and shape features that are considered for better results. The image retrieval is done by the computation of similarity score using distances such as Manhattan, Euclidean, vector cosine angle, and histogram intersection distance to obtain the optimal match from the database. The manually picked dataset of 200 images is considered for performance analysis. By extracting both the shape features and texture features, the proposed approach achieved maximum accuracy, precision, recall, F-measure, sensitivity, and specificity and lower false-positive and negative values.

Recently, inspired by the growing power of deep convolutional neural networks (CNNs) and generative adversarial networks (GANs), facial image editing has received increasing attention and has produced a series of wide-ranging applications. In this paper, we propose a new and effective approach to a challenging task: synthesizing face images based on key facial parts. The proposed approach is a novel deep generative network that can automatically align facial parts with the precise positions in a face image and then output an entire facial image conditioned on the well-aligned parts. Specifically, three loss functions are introduced in this approach, which are the key to making the synthesized realistic facial image: a reconstruction loss to generate image content in an unknown region, a perceptual loss to enhance the network's ability to model high-level semantic structures and an adversarial loss to ensure that the synthesized images are visually realistic. In this approach, the three components cooperate well to form an effective framework for parts-based high-quality facial image synthesis. Finally, extensive experiments demonstrate the superior performance of this method to existing solutions.

With the rapid development of 3D coding and display technologies, numerous applications are emerging to target human immersive entertainments. To achieve a prime 3D visual experience, high accuracy depth maps play a crucial role. However, depth maps retrieved from most devices still suffer inaccuracies at object boundaries. Therefore, a depth enhancement system is usually needed to correct the error. Recent developments by applying deep learning to deep enhancement have shown their promising improvement. In this paper, we propose a deep depth enhancement network system that effectively corrects the inaccurate depth using color images as a guide. The proposed network contains both depth and image branches, where we combine a new set of features from the image branch with those from the depth branch. Experimental results show that the proposed system achieves a better depth correction performance than state of the art advanced networks. The ablation study reveals that the proposed loss functions in use of image information can enhance depth map accuracy effectively.

Bosniak renal cyst classification has been widely used in determining the complexity of a renal cyst. However, it turns out that about half of patients undergoing surgery for Bosniak category III, take surgical risks that reward them with no clinical benefit at all. This is because their pathological results reveal that the cysts are actually benign not malignant. This problem inspires us to use recently popular deep learning techniques and study alternative analytics methods for precise binary classification (benign or malignant tumor) on Computerized Tomography (CT) images. To achieve our goal, two consecutive steps are required–segmenting kidney organs or lesions from CT images then classifying the segmented kidneys. In this paper, we propose a study of kidney segmentation using 2.5D ResUNet and 2.5D DenseUNet for efficiently extracting intra-slice and inter-slice features. Our models are trained and validated on the public data set from Kidney Tumor Segmentation (KiTS19) challenge in two different training environments. As a result, all experimental models achieve high mean kidney Dice scores of at least 95% on the KiTS19 validation set consisting of 60 patients. Apart from the KiTS19 data set, we also conduct separate experiments on abdomen CT images of four Thai patients. Based on the four Thai patients, our experimental models show a drop in performance, where the best mean kidney Dice score is 87.60%.

Despite the impressive performance of correlation filter-based trackers in terms of robustness and accuracy, the trackers have room for improvement. The majority of existing trackers use a single feature or fixed fusion weights, which makes it possible for tracking to fail in the case of deformation or severe occlusion. In this paper, we propose a multi-feature response map adaptive fusion strategy based on the consistency of individual features and fused feature. It is able to improve the robustness and accuracy by building the better object appearance model. Moreover, since the response map has multiple local peaks when the target is occluded, we propose an anti-occlusion mechanism. Specifically, if the nonmaximal local peak is satisfied with our proposed conditions, we generate a new response map which is obtained by moving the center of the region of interest to the nonmaximal local peak position of the response map and re-extracting features. We then select the response map with the largest response value as the final response map. This proposed anti-occlusion mechanism can effectively cope with the problem of tracking failure caused by occlusion. Finally, by adjusting the learning rate in different scenes, we designed a high-confidence model update strategy to deal with the problem of model pollution. Besides, we conducted experiments on OTB2013, OTB2015, TC128 and UAV123 datasets and compared them with the current state-of-the-art algorithms, and the proposed algorithms have impressive advantages in terms of accuracy and robustness.

The paper presents a novel approach for designing the CNN structure of improved generalization capability in the presence of a small population of learning data. Unlike the classical methods for building CNN, we propose to introduce some randomness in the choice of layers with a different type of nonlinear activation function. The image processing in these layers is performed using either the ReLU or the softplus function. This choice is random. The randomness introduced in the network structure can be interpreted as a special form of regularization. Experiments performed on the detection of images belonging to either melanoma or non-melanoma cases have shown a significant improvement in average quality measures such as accuracy, sensitivity, precision, and area under the ROC curve.

Examining the authenticity of images has become increasingly important as manipulation tools become more accessible and advanced. Recent work has shown that while CNN-based image manipulation detectors can successfully identify manipulations, they are also vulnerable to adversarial attacks, ranging from simple double JPEG compression to advanced pixel-based perturbation. In this paper we explore another method of highly plausible attack: printing and scanning. We demonstrate the vulnerability of two state-of-the-art models to this type of attack. We also propose a new machine learning model that performs comparably to these state-of-the-art models when trained and validated on printed and scanned images. Of the three models, our proposed model outperforms the others when trained and validated on images from a single printer. To facilitate this exploration, we create a data set of over 6000 printed and scanned image blocks. Further analysis suggests that variation between images produced from different printers is significant, large enough that good validation accuracy on images from one printer does not imply similar validation accuracy on identical images from a different printer.

With the growing demand for image and video-based applications, the requirements of consistent quality assessment metrics of image and video have increased. Different approaches have been proposed in the literature to estimate the perceptual quality of images and videos. These approaches can be divided into three main categories; full reference (FR), reduced reference (RR) and no-reference (NR). In RR methods, instead of providing the original image or video as a reference, we need to provide certain features (i.e., texture, edges, etc.) of the original image or video for quality assessment. During the last decade, RR-based quality assessment has been a popular research area for a variety of applications such as social media, online games, and video streaming. In this paper, we present review and classification of the latest research work on RR-based image and video quality assessment. We have also summarized different databases used in the field of 2D and 3D image and video quality assessment. This paper would be helpful for specialists and researchers to stay well-informed about recent progress of RR-based image and video quality assessment. The review and classification presented in this paper will also be useful to gain understanding of multimedia quality assessment and state-of-the-art approaches used for the analysis. In addition, it will help the reader select appropriate quality assessment methods and parameters for their respective applications.

Perceptual video hashing represents video perceptual content by compact hash. The binary hash is sensitive to content distortion manipulations, but robust to perceptual content preserving operations. Currently, boundary between sensitivity and robustness is often ambiguous and it is decided by an empirically defined threshold. This may result in large false positive rates when received video is to be judged similar or dissimilar in some circumstances, e.g., video content authentication. In this paper, we propose a novel perceptual hashing method for video content authentication based on maximized robustness. The developed idea of maximized robustness means that robustness is maximized on condition that security requirement of hash is first met. We formulate the video hashing as a constrained optimization problem, in which coefficients of features offset and robustness are to be learned. Then we adopt a stochastic optimization method to solve the optimization. Experimental results show that the proposed hashing is quite suitable for video content authentication in terms of security and robustness.

Conventional surveillance systems for preventing accidents and incidents do not identify 95% thereof after 22 min when one person monitors a plurality of closed circuit televisions (CCTV). To address this issue, while computer-based intelligent video surveillance systems have been studied to notify users of abnormal situations when they happen, it is not commonly used in real environment because of weakness of personal information leaks and high power consumption. To address this issue, intelligent video surveillance systems based on small devices have been studied. This paper suggests implement an intelligent video surveillance system based on embedded modules for intruder detection based on information learning, fire detection based on color and motion information, and loitering and fall detection based on human body motion. Moreover, an algorithm and an embedded module optimization method are applied for real-time processing. The implemented algorithm showed performance of 88.51% for intruder detection, 92.63% for fire detection, 80% for loitering detection and 93.54% for fall detection. The result of comparison before and after optimization about the algorithm processing time showed 50.53% of decrease, implying potential real-time driving of the intelligent image monitoring system based on embedded modules.

Accurate segmentation and classification of pulmonary nodules are of great significance to early detection and diagnosis of lung diseases, which can reduce the risk of developing lung cancer and improve patient survival rate. In this paper, we propose an effective network for pulmonary nodule segmentation and classification at one time based on adversarial training scheme. The segmentation network consists of a High-Resolution network with Multi-scale Progressive Fusion (HR-MPF) and a proposed Progressive Decoding Module (PDM) recovering final pixel-wise prediction results. Specifically, the proposed HR-MPF firstly incorporates boosted module to High-Resolution Network (HRNet) in a progressive feature fusion manner. In this case, feature communication is augmented among all levels in this high-resolution network. Then, downstream classification module would identify benign and malignant pulmonary nodules based on feature map from PDM. In the adversarial training scheme, a discriminator is set to optimize HR-MPF and PDM through back propagation. Meanwhile, a reasonably designed multi-task loss function optimizes performance of segmentation and classification overall. To improve the accuracy of boundary prediction crucial to nodule segmentation, a boundary consistency constraint is designed and incorporated in the segmentation loss function. Experiments on publicly available LUNA16 dataset show that the framework outperforms relevant advanced methods in quantitative evaluation and visual perception.

To accurately identify fatigued driving, establishing a monitoring system is one of the important guarantees of improving traffic safety and reducing traffic accidents. Among many research methods, electrooculogram signal (EOG) has unique advantages. This paper presents a systematic literature review of these technologies and summarizes a basic framework of fatigue driving monitoring system based on EOGs. Then we summarize the advantages and disadvantages of existing technologies. In addition, 80 primary references published during the last decade were identified. The multi-feature fusion technique based on EOGs performs better than other traditional methods due to its low cost, low power consumption and low intrusion, while its application is still limited which needs more efforts to obtain good and generalizable results. And then, an overview of the literature on technology is given, revealing a premier and unbiased survey of the existing empirical research of classification techniques that have been applied to fatigue driving analysis. Finally, this paper adds value to the current literature by investigating the application of EOG signals in fatigued driving and the design of related systems, future guidelines have been provided to practitioners and researchers to grasp the major contributions and challenges in the state-of-the-art research.

Out-of-home audience measurement aims to count and characterize the people exposed to advertising content in the physical world. While audience measurement solutions based on computer vision are of increasing interest, no commonly accepted benchmark exists to evaluate and compare their performance. In this paper, we propose the first benchmark for digital out-of-home audience measurement that evaluates the vision-based tasks of audience localization and counting, and audience demographics. The benchmark is composed of a novel, dataset captured at multiple locations and a set of performance measures. Using the benchmark, we present an in-depth comparison of eight open-source algorithms on four hardware platforms with GPU and CPU-optimized inferences and of two commercial off-the-shelf solutions for localization, count, age, and gender estimation. This benchmark and related open-source codes are available at http://ava.eecs.qmul.ac.uk .

Quick Response (QR) codes are designed for information storage and high-speed reading applications. To store additional information, Two-Level QR (2LQR) codes replace black modules in standard QR codes with specific texture patterns. When the 2LQR code is printed, texture patterns are blurred and their sizes are smaller than $$0.5{\mathrm{cm}}^{2}$$ 0.5 cm 2 . Recognizing small-sized blurred texture patterns is challenging. In original 2LQR literature, recognition of texture patterns is based on maximizing the correlation between print-and-scanned texture patterns and the original digital ones. When employing desktop printers with large pixel extensions and low-resolution capture devices, the recognition accuracy of texture patterns greatly reduces. To improve the recognition accuracy under this situation, our work presents a dictionary learning based scheme to recognize printed texture patterns. To our best knowledge, it is the first attempt to use dictionary learning to promote the recognition accuracy of printed texture patterns. In our scheme, dictionaries for all kinds of texture patterns are learned from print-and-scanned texture modules in the training stage. And these learned dictionaries are employed to represent each texture module in the testing stage (extracting process) to recognize their texture pattern. Experimental results show that our proposed algorithm significantly reduces the recognition error of small-sized printed texture patterns.

Multiple-license plate recognition is gaining popularity in the Intelligent Transport System (ITS) applications for security monitoring and surveillance. Advancements in acquisition devices have increased the availability of high definition (HD) images, which can capture images of multiple vehicles. Since license plate (LP) occupies a relatively small portion of an image, therefore, detection of LP in an image is considered a challenging task. Moreover, the overall performance deteriorates when the aforementioned factor combines with varying illumination conditions, such as night, dusk, and rainy. As it is difficult to locate a small object in an entire image, this paper proposes a two-step approach for plate localization in challenging conditions. In the first step, the Faster-Region-based Convolutional Neural Network algorithm (Faster R-CNN) is used to detect all the vehicles in an image, which results in scaled information to locate plates. In the second step, morphological operations are employed to reduce non-plate regions. Meanwhile, geometric properties are used to localize plates in the HSI color space. This approach increases accuracy and reduces processing time. For character recognition, the look-up table (LUT) classifier using adaptive boosting with modified census transform (MCT) as a feature extractor is used. Both proposed plate detection and character recognition methods have significantly outperformed conventional approaches in terms of precision and recall for multiple plate recognition.

With regard to video streaming services under wireless networks, how to improve the quality of experience (QoE) has always been a challenging task. Especially after the arrival of the 5G era, more attention has been paid to analyze the experience quality of video streaming in more complex network scenarios (such as 5G-powered drone video transmission). Insufficient buffer in the video stream transmission process will cause the playback to freeze [1]. In order to cope with this defect, this paper proposes a buffer starvation evaluation model based on deep learning and a video stream scheduling model based on reinforcement learning. This approach uses the method of machine learning to extract the correlation between the buffer starvation probability distribution and the traffic load, thereby obtaining the explicit evaluation results of buffer starvation events and a series of resource allocation strategies that optimize long-term QoE. In order to deal with the noise problem caused by the random environment, the model introduces an internal reward mechanism in the scheduling process, so that the agent can fully explore the environment. Experiments have proved that our framework can effectively evaluate and improve the video service quality of 5G-powered UAV.

A stable enhanced superresolution generative adversarial network (SESRGAN) algorithm was proposed in this study to address the low-resolution and blurred texture details in ancient murals. This algorithm makes improvements on the basis of GANs, which use dense residual blocks to extract image features. After two upsampling steps, the feature information of the image is input into the high-resolution (HR) image space to realize an improvement in resolution, and the reconstructed HR image is finally generated. The discriminator network uses VGG as its basic framework to judge the authenticity of the input image. This study further optimized the details of the network model. In addition, three loss optimization models, i.e., the perceptual loss, content loss, and adversarial loss models, were integrated into the proposed algorithm. The Wasserstein GAN-gradient penalty (WGAN-GP) theory was used to optimize the adversarial loss of the model when calculating the perceptual loss and when using the preactivation feature information for calculation purposes. In addition, public data sets were used to pretrain the generative network model to achieve a high-quality initialization. The simulation experiment results showed that the proposed algorithm outperforms other related superresolution algorithms in terms of both objective and subjective evaluation indicators. A subjective perception evaluation was also conducted, and the reconstructed images produced by our algorithm were more in line with the general public’s visual perception than those produced by the other compared algorithms.

In this paper, we propose a pansharpening method based on a convolutional autoencoder. The convolutional autoencoder is a sort of convolutional neural network (CNN) and objective to scale down the input dimension and typify image features with high exactness. First, the autoencoder network is trained to reduce the difference between the degraded panchromatic image patches and reconstruction output original panchromatic image patches. The intensity component, which is developed by adaptive intensity-hue-saturation (AIHS), is then delivered into the trained convolutional autoencoder network to generate an enhanced intensity component of the multi-spectral image. The pansharpening is accomplished by improving the panchromatic image from the enhanced intensity component using a multi-scale guided filter; then, the semantic detail is injected into the upsampled multi-spectral image. Real and degraded datasets are utilized for the experiments, which exhibit that the proposed technique has the ability to preserve the high spatial details and high spectral characteristics simultaneously. Furthermore, experimental results demonstrated that the proposed study performs state-of-the-art results in terms of subjective and objective assessments on remote sensing data.

Robust vision-based hand pose estimation is highly sought but still remains a challenging task, due to its inherent difficulty partially caused by self-occlusion among hand fingers. In this paper, an innovative framework for real-time static hand gesture recognition is introduced, based on an optimized shape representation build from multiple shape cues. The framework incorporates a specific module for hand pose estimation based on depth map data, where the hand silhouette is first extracted from the extremely detailed and accurate depth map captured by a time-of-flight (ToF) depth sensor. A hybrid multi-modal descriptor that integrates multiple affine-invariant boundary-based and region-based features is created from the hand silhouette to obtain a reliable and representative description of individual gestures. Finally, an ensemble of one-vs.-all support vector machines (SVMs) is independently trained on each of these learned feature representations to perform gesture classification. When evaluated on a publicly available dataset incorporating a relatively large and diverse collection of egocentric hand gestures, the approach yields encouraging results that agree very favorably with those reported in the literature, while maintaining real-time operation.

In lately published video coding standard Versatile Video Coding (VVC/ H.266), the intra sub-partitions (ISP) coding mode is proposed. It is efficient for frames with rich texture, but less efficient for frames that are very flat or constant. In this paper, by comparing and analyzing the rate distortion cost (RD-cost) of coding unit (CU) with different texture features for using and not using ISP(No-ISP) coding mode, it is observed that CUs with simple texture can skip ISP coding mode. Based on this observation, a fast ISP coding mode optimization algorithm based on CU texture complexity is proposed, which aims to determine whether a CU needs to use ISP coding mode in advance by calculating CU texture complexity, so as to reduce the computation complexity of ISP. The experimental results show that under All Intra (AI) configuration, the coding time can be reduced by 7%, while the BD rate only increase by 0.09%.

In human anatomy, the central nervous system (CNS) acts as a significant processing hub. CNS is clinically divided into two major parts: the brain and the spinal cord. The spinal cord assists the overall communication network of the human anatomy through the brain. The mobility of body and the structure of the whole skeleton is also balanced with the help of the spinal bone, along with reflex control. According to the Global Burden of Disease 2010, worldwide, back pain issues are the leading cause of disability. The clinical specialists in the field estimate almost 80% of the population with experience of back issues. The segmentation of the vertebrae is considered a difficult procedure through imaging. The problem has been catered by different researchers using diverse hand-crafted features like Harris corner, template matching, active shape models, and Hough transform. Existing methods do not handle the illumination changes and shape-based variations. The low-contrast and unclear view of the vertebrae also makes it difficult to get good results. In recent times, convolutional nnural Network (CNN) has taken the research to the next level, producing high-accuracy results. Different architectures of CNN such as UNet, FCN, and ResNet have been used for segmentation and deformity analysis. The aim of this review article is to give a comprehensive overview of how different authors in different times have addressed these issues and proposed different mythologies for the localization and analysis of curvature deformity of the vertebrae in the spinal cord.

Recent years have witnessed a substantial increase in the deep learning (DL) architectures proposed for visual recognition tasks like person re-identification, where individuals must be recognized over multiple distributed cameras. Although these architectures have greatly improved the state-of-the-art accuracy, the computational complexity of the convolutional neural networks (CNNs) commonly used for feature extraction remains an issue, hindering their deployment on platforms with limited resources, or in applications with real-time constraints. There is an obvious advantage to accelerating and compressing DL models without significantly decreasing their accuracy. However, the source (pruning) domain differs from operational (target) domains, and the domain shift between image data captured with different non-overlapping camera viewpoints leads to lower recognition accuracy. In this paper, we investigate the prunability of these architectures under different design scenarios. This paper first revisits pruning techniques that are suitable for reducing the computational complexity of deep CNN networks applied to person re-identification. Then, these techniques are analyzed according to their pruning criteria and strategy and according to different scenarios for exploiting pruning methods to fine-tuning networks to target domains. Experimental results obtained using DL models with ResNet feature extractors, and multiple benchmarks re-identification datasets, indicate that pruning can considerably reduce network complexity while maintaining a high level of accuracy. In scenarios where pruning is performed with large pretraining or fine-tuning datasets, the number of FLOPS required by ResNet architectures is reduced by half, while maintaining a comparable rank-1 accuracy (within 1% of the original model). Pruning while training a larger CNNs can also provide a significantly better performance than fine-tuning smaller ones.

With the increasing application of computer vision technology in autonomous driving, robot, and other mobile devices, more and more attention has been paid to the implementation of target detection and tracking algorithms on embedded platforms. The real-time performance and robustness of algorithms are two hot research topics and challenges in this field. In order to solve the problems of poor real-time tracking performance of embedded systems using convolutional neural networks and low robustness of tracking algorithms for complex scenes, this paper proposes a fast and accurate real-time video detection and tracking algorithm suitable for embedded systems. The algorithm combines the object detection model of single-shot multibox detection in deep convolution networks and the kernel correlation filters tracking algorithm, what is more, it accelerates the single-shot multibox detection model using field-programmable gate arrays, which satisfies the real-time performance of the algorithm on the embedded platform. To solve the problem of model contamination after the kernel correlation filters algorithm fails to track in complex scenes, an improvement in the validity detection mechanism of tracking results is proposed that solves the problem of the traditional kernel correlation filters algorithm not being able to robustly track for a long time. In order to solve the problem that the missed rate of the single-shot multibox detection model is high under the conditions of motion blur or illumination variation, a strategy to reduce missed rate is proposed that effectively reduces the missed detection. The experimental results on the embedded platform show that the algorithm can achieve real-time tracking of the object in the video and can automatically reposition the object to continue tracking after the object tracking fails.

Single-frame image super-resolution (SISR) technology in remote sensing is improving fast from a performance point of view. Deep learning methods have been widely used in SISR to improve the details of rebuilt images and speed up network training. However, these supervised techniques usually tend to overfit quickly due to the models’ complexity and the lack of training data. In this paper, an Improved Deep Recursive Residual Network (IDRRN) super-resolution model is proposed to decrease the difficulty of network training. The deep recursive structure is configured to control the model parameter number while increasing the network depth. At the same time, the short-path recursive connections are used to alleviate the gradient disappearance and enhance the feature propagation. Comprehensive experiments show that IDRRN has a better improvement in both quantitation and visual perception.

Computer vision is an interdisciplinary domain for object detection. Object detection relay is a vital part in assisting surveillance, vehicle detection and pose estimation. In this work, we proposed a novel deep you only look once (deep YOLO V3) approach to detect the multi-object. This approach looks at the entire frame during the training and test phase. It followed a regression-based technique that used a probabilistic model to locate objects. In this, we construct 106 convolution layers followed by 2 fully connected layers and 812 × 812 × 3 input size to detect the drones with small size. We pre-train the convolution layers for classification at half the resolution and then double the resolution for detection. The number of filters of each layer will be set to 16. The number of filters of the last scale layer is more than 16 to improve the small object detection. This construction uses up-sampling techniques to improve undesired spectral images into the existing signal and rescaling the features in specific locations. It clearly reveals that the up-sampling detects small objects. It actually improves the sampling rate. This YOLO architecture is preferred because it considers less memory resource and computation cost rather than more number of filters. The proposed system is designed and trained to perform a single type of class called drone and the object detection and tracking is performed with the embedded system-based deep YOLO. The proposed YOLO approach predicts the multiple bounding boxes per grid cell with better accuracy. The proposed model has been trained with a large number of small drones with different conditions like open field, and marine environment with complex background.

Inspired by quaternion algebra and the idea of fractional-order transformation, we propose a new set of quaternion fractional-order generalized Laguerre orthogonal moments (QFr-GLMs) based on fractional-order generalized Laguerre polynomials. Firstly, the proposed QFr-GLMs are directly constructed in Cartesian coordinate space, avoiding the need for conversion between Cartesian and polar coordinates; therefore, they are better image descriptors than circularly orthogonal moments constructed in polar coordinates. Moreover, unlike the latest Zernike moments based on quaternion and fractional-order transformations, which extract only the global features from color images, our proposed QFr-GLMs can extract both the global and local color features. This paper also derives a new set of invariant color-image descriptors by QFr-GLMs, enabling geometric-invariant pattern recognition in color images. Finally, the performances of our proposed QFr-GLMs and moment invariants were evaluated in simulation experiments of correlated color images. Both theoretical analysis and experimental results demonstrate the value of the proposed QFr-GLMs and their geometric invariants in the representation and recognition of color images.

In this paper, based on the parametric model of the matrix of discrete cosine transform (DCT), and using an exhaustive search of the parameters’ space, we seek for the best approximations of 8-point DCT at the given computational complexities by taking into account three different scenarios of practical usage. The possible parameter values are selected in such a way that the resulting transforms are only multiplierless approximations, i.e., only additions and bit-shift operations are required. The considered usage scenarios include such cases where approximation of DCT is used: (i) at the data compression stage, (ii) at the decompression stage, and (iii) both at the compression and decompression stages. The obtained results in effectiveness of generated approximations are compared with results of popular known approximations of 8-point DCT of the same class (i.e., multiplierless approximations). In addition, we perform a series of experiments in lossy compression of natural images using popular JPEG standard. The obtained results are presented and discussed. It should be noted, that in the overwhelming number of cases the generated approximations are better than the known ones, e.g., in asymmetric scenarios even by more than 3 dB starting from entropy of 2 bits per pixel. In the last part of the paper, we investigate the possibility of hardware implementation of generated approximations in Field-Programmable Gate Array (FPGA) circuits. The results in the form of resource and energy consumption are presented and commented. The experiment outcomes confirm the assumption that the considered class of transformations is characterized by low resource utilization.

In this paper, we propose a hybrid super-resolution method by combining global and local dictionary training in the sparse domain. In order to present and differentiate the feature mapping in different scales, a global dictionary set is trained in multiple structure scales, and a non-linear function is used to choose the appropriate dictionary to initially reconstruct the HR image. In addition, we introduce the Gaussian blur to the LR images to eliminate a widely used but inappropriate assumption that the low resolution (LR) images are generated by bicubic interpolation from high-resolution (HR) images. In order to deal with Gaussian blur, a local dictionary is generated and iteratively updated by K -means principal component analysis (K-PCA) and gradient decent (GD) to model the blur effect during the down-sampling. Compared with the state-of-the-art SR algorithms, the experimental results reveal that the proposed method can produce sharper boundaries and suppress undesired artifacts with the present of Gaussian blur. It implies that our method could be more effect in real applications and that the HR-LR mapping relation is more complicated than bicubic interpolation.

Methods using salient facial patches (SFPs) play a significant role in research on facial expression recognition. However, most SFP methods use only frontal face images or videos for recognition, and they do not consider head position variations. We contend that SFP can be an effective approach for recognizing facial expressions under different head rotations. Accordingly, we propose an algorithm, called profile salient facial patches (PSFP), to achieve this objective. First, to detect facial landmarks and estimate head poses from profile face images, a tree-structured part model is used for pose-free landmark localization. Second, to obtain the salient facial patches from profile face images, the facial patches are selected using the detected facial landmarks while avoiding their overlap or the transcending of the actual face range. To analyze the PSFP recognition performance, three classical approaches for local feature extraction, specifically the histogram of oriented gradients (HOG), local binary pattern, and Gabor, were applied to extract profile facial expression features. Experimental results on the Radboud Faces Database show that PSFP with HOG features can achieve higher accuracies under most head rotations.

Binarization plays an important role in document analysis and recognition (DAR) systems. In this paper, we present our winning algorithm in ICFHR 2018 competition on handwritten document image binarization (H-DIBCO 2018), which is based on background estimation and energy minimization. First, we adopt mathematical morphological operations to estimate and compensate the document background. It uses a disk-shaped structuring element, whose radius is computed by the minimum entropy-based stroke width transform (SWT). Second, we perform Laplacian energy-based segmentation on the compensated document images. Finally, we implement post-processing to preserve text stroke connectivity and eliminate isolated noise. Experimental results indicate that the proposed method outperforms other state-of-the-art techniques on several public available benchmark datasets.

Agriculture plays a critical role in the economy of several countries, by providing the main sources of income, employment, and food to their rural population. However, in recent years, it has been observed that plants and fruits are widely damaged by different diseases which cause a huge loss to the farmers, although this loss can be minimized by detecting plants’ diseases at their earlier stages using pattern recognition (PR) and machine learning (ML) techniques. In this article, an automated system is proposed for the identification and recognition of fruit diseases. Our approach is distinctive in a way, it overcomes the challenges like convex edges, inconsistency between colors, irregularity, visibility, scale, and origin. The proposed approach incorporates five primary steps including preprocessing,Standard instruction requires city and country for affiliations. Hence, please check if the provided information for each affiliation with missing data is correct and amend if deemed necessary. disease identification through segmentation, feature extraction and fusion, feature selection, and classification. The infection regions are extracted using the proposed adaptive and quartile deviation-based segmentation approach and fused resultant binary images by employing the weighted coefficient of correlation (CoC). Then the most appropriate features are selected using a novel framework of entropy and rank-based correlation (EaRbC). Finally, selected features are classified using multi-class support vector machine (MC-SCM). A PlantVillage dataset is utilized for the evaluation of the proposed system to achieving an average segmentation and classification accuracy of 93.74% and 97.7%, respectively. From the set of statistical measure, we sincerely believe that our proposed method outperforms existing method with greater accuracy.

The autoimmune disorders such as rheumatoid, arthritis, and scleroderma are connective tissue diseases (CTD). Autoimmune diseases are generally diagnosed using the antinuclear antibody (ANA) blood test. This test uses indirect immune fluorescence (IIf) image analysis to detect the presence of liquid substance antibodies at intervals the blood, which is responsible for CTDs. Typically human alveolar epithelial cells type 2 (HEp2) are utilized as the substrate for the microscope slides. The various fluorescence antibody patterns on HEp-2 cells permits the differential designation-diagnosis. The segmentation of HEp-2 cells of IIf images is therefore a crucial step in the ANA test. However, not only this task is extremely challenging, but physicians also often have a considerable number of IIf images to examine.In this study, we propose a new methodology for HEp2 segmentation from IIf images by maximum modified quantum entropy. Besides, we have used a new criterion with a flexible representation of the quantum image(FRQI). The proposed methodology determines the optimum threshold based on the quantum entropy measure, by maximizing the measure of class separability for the obtained classes over all the gray levels. We tested the suggested algorithm over all images of the MIVIA HEp 2 image data set.To objectively assess the proposed methodology, segmentation accuracy (SA), Jaccard similarity (JS), the F1-measure,the Matthews correlation coefficient(MCC), and the peak signal-to-noise ratio (PSNR) were used to evaluate performance. We have compared the proposed methodology with quantum entropy, Kapur and Otsu algorithms, respectively.The results show that the proposed algorithm is better than quantum entropy and Kapur methods. In addition, it overcomes the limitations of the Otsu method concerning the images which has positive skew histogram.This study can contribute to create a computer-aided decision (CAD) framework for the diagnosis of immune system diseases

Depth is essential information for autonomous robotics applications that need environmental depth values. The depth could be acquired by finding the matching pixels between stereo image pairs. Depth information is an inference from a matching cost volume that is composed of the distances between the possible pixel points on the pre-aligned horizontal axis of stereo images. Most approaches use matching costs to identify matches between stereo images and obtain depth information. Recently, researchers have been using convolutional neural network-based solutions to handle this matching problem. In this paper, a novel method has been proposed for the refinement of matching costs by using recurrent neural networks. Our motivation is to enhance the depth values obtained from matching costs. For this purpose, to attain an enhanced disparity map by utilizing the sequential information of matching costs in the horizontal space, recurrent neural networks are used. Exploiting this sequential information, we aimed to determine the position of the correct matching point by using recurrent neural networks, as in the case of speech processing problems. We used existing stereo algorithms to obtain the initial matching costs and then improved the results by utilizing recurrent neural networks. The results are evaluated on the KITTI 2012 and KITTI 2015 datasets. The results show that the matching cost three-pixel error is decreased by an average of 14.5% in both datasets.

Perfect image compositing can harmonize the appearance between the foreground and background effectively so that the composite result looks seamless and natural. However, the traditional convolutional neural network (CNN)-based methods often fail to yield highly realistic composite results due to overdependence on scene parsing while ignoring the coherence of semantic and structural between foreground and background. In this paper, we propose a framework to solve this problem by training a stacked generative adversarial network with attention guidance, which can efficiently create a high-resolution, realistic-looking composite. To this end, we develop a diverse adversarial loss in addition to perceptual and guidance loss to train the proposed generative network. Moreover, we construct a multi-scenario dataset for high-resolution image compositing, which contains high-quality images with different styles and object masks. Experiments on the synthesized and real images demonstrate the efficiency and effectiveness of our network in producing seamless, natural, and realistic results. Ablation studies show that our proposed network can improve the visual performance of composite results compared with the application of existing methods.

Top-cited authors
• University of Oulu
• University of Campinas
• Idiap Research Institute
• University of Oulu
• Blekinge Institute of Technology