Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Moreover, the training of multicolumn convolutional networks is difficult and timeconsuming, which cannot meet the requirements of practical applications, as analyzed in CSRNet [5]. CSRNet uses the deep convolutional neural network VGG-16 net [14] that removes the fully connected layer as the feature extractor, followed by a 7-layer dilated convolution as the regression, which can expand the receptive domain of the network to obtain sufficient spatial context information. However, CSRNet cannot make the counting network obtain appropriate spatial context information, because it cannot explore the final contribution of different receptive domain information to the counting model. ...
... In response to the shortcomings of the above research methods, this paper designs a new network SCFFNet, which combines a multiscale context feature fusion module and channel spatial attention-aware module for understanding highly crowded scenes and counting accurately; this is a trainable end-to-end deep network architecture, which mainly consists of a front-end feature extraction network, multiscale context feature fusion module (MCFFM), channel spatial attention-aware module (CSAM), and backend network. e front-end feature extraction network selects the VGG-16 net [14] with the fully connected layer removed to extract feature information, and the back-end network consists of 2D dilated convolution layers [5]. SCFFNet can adaptively fuse multiscale context features to accommodate rapid scale changes and suppress the interference of background noise to generate high-quality density maps. ...
... Network. Similar to some previous research work on crowd counting [5,40,45], we use the first 10 convolutional layers of the VGG-16 network [14] as the front-end feature extraction network because of its strong transfer learning ability. Given an image I e , the feature V e extracted by the VGG-16 backbone is calculated as follows: ...
Article
Full-text available
Accurate counting in dense scenes can effectively prevent the occurrence of abnormal events, which is crucial for flow management, traffic control, and urban safety. In recent years, the application of deep learning technology in counting tasks has significantly improved the performance of models, but it still faces many challenges, including the diversity of target distribution between image and background, the drastic change of target scale, and serious occlusion. To solve these problems, this paper proposes a spatial context feature fusion network, abbreviated as SCFFNet, to understand highly congested scenes and perform accurate counts as well as produce high-quality estimated density maps. SCFFNet first uses rich convolutions with different scales to calculate scale-aware features, adaptively encodes the scale of contextual information needed to accurately estimate density maps, and then calibrates and refuses the fused feature maps through a channel spatial attention-aware module, which improves the model’s ability to suppress background and focus on main features. Finally, the final estimated density map is generated by a dilated convolution module. We conduct experiments on five public crowd datasets, UCF_CC_50, WorldExpo’10, ShanghaiTech, Mall, and Beijing BRT, and the results show that our method achieves lower counting errors than existing state-of-the-art methods. In addition, we extend SCFFNet to count other objects, such as vehicles in the vehicle dataset HBR_YD, and the experimental results show that our proposed method significantly improves the output quality with higher accuracy than previous methods.
... This is a natural consequence of the outstanding performance retrieved using the deep learning methods in image classification [37]. The most popular paradigm of the deep learning technology in the medical imaging is transfer learning which is based on using pretrained CNN such as Inception V3 [38], AlexNet [39], VGG19/VGG16 [40], ResNet50 [41], and GoogleNet [42]. The pretrained CNNs have been already trained on natural images such as ImageNet. ...
... In this work, we have utilized the convolutional and pooling layers of pretrained CNNs in extracting the significant features in benign/malignant images of breast cancer; however, the FC layer has been replaced by traditional classifiers. AlexNet [39], VGG [40], and Googlenet [42] are some of the most popular pre-trained CNN models for image classification. Abdelhafiz et al. [68] have surveyed many research articles demonstrating their effectiveness in breast cancer classification. ...
... VGG [40] has been introduced in the ImageNet Challenge 2014 by Karen Simonyan, and Andrew Zisserman. Using VGG, it has been shown that the depth of the network has a significant impact on CNN's accuracy. ...
Article
Full-text available
One of the most promising research areas in the healthcare industry and the scientific community is focusing on the AI-based applications for real medical challenges such as the building of computer-aided diagnosis (CAD) systems for breast cancer. Transfer learning is one of the recent emerging AI-based techniques that allow rapid learning progress and improve medical imaging diagnosis performance. Although deep learning classification for breast cancer has been widely covered, certain obstacles still remain to investigate the independency among the extracted high-level deep features. This work tackles two challenges that still exist when designing effective CAD systems for breast lesion classification from mammograms. The first challenge is to enrich the input information of the deep learning models by generating pseudo-colored images instead of only using the input original grayscale images. To achieve this goal two different image preprocessing techniques are parallel used: contrast-limited adaptive histogram equalization (CLAHE) and Pixel-wise intensity adjustment. The original image is preserved in the first channel, while the other two channels receive the processed images, respectively. The generated three-channel pseudo-colored images are fed directly into the input layer of the backbone CNNs to generate more powerful high-level deep features. The second challenge is to overcome the multicollinearity problem that occurs among the high correlated deep features generated from deep learning models. A new hybrid processing technique based on Logistic Regression (LR) as well as Principal Components Analysis (PCA) is presented and called LR-PCA. Such a process helps to select the significant principal components (PCs) to further use them for the classification purpose. The proposed CAD system has been examined using two different public benchmark datasets which are INbreast and mini-MAIS. The proposed CAD system could achieve the highest performance accuracies of 98.60% and 98.80% using INbreast and mini-MAIS datasets, respectively. Such a CAD system seems to be useful and reliable for breast cancer diagnosis.
... Vedaldi et al., (2015) define the softmax function using Equation (9): The Keras package contains many DL architectures. The Keras VGG16 model is based on the model proposed by Simonyan and Zisserman (2014) [56]. Max pooling was used to achieve the best results in crack detection [57]. ...
... The Keras package contains many DL architectures. The Keras VGG16 model is based on the model proposed by Simonyan and Zisserman (2014) [56]. Max pooling was used to achieve the best results in crack detection [57]. ...
Article
Full-text available
Infrastructure, such as buildings, bridges, pavement, etc., needs to be examined periodically to maintain its reliability and structural health. Visual signs of cracks and depressions indicate stress and wear and tear over time, leading to failure/collapse if these cracks are located at critical locations, such as in load-bearing joints. Manual inspection is carried out by experienced inspectors who require long inspection times and rely on their empirical and subjective knowledge. This lengthy process results in delays that further compromise the infrastructure’s structural integrity. To address this limitation, this study proposes a deep learning (DL)-based autonomous crack detection method using the convolutional neural network (CNN) technique. To improve the CNN classification performance for enhanced pixel segmentation, 40,000 RGB images were processed before training a pretrained VGG16 architecture to create different CNN models. The chosen methods (grayscale, thresholding, and edge detection) have been used in image processing (IP) for crack detection, but not in DL. The study found that the grayscale models (F1 score for 10 epochs: 99.331%, 20 epochs: 99.549%) had a similar performance to the RGB models (F1 score for 10 epochs: 99.432%, 20 epochs: 99.533%), with the performance increasing at a greater rate with more training (grayscale: +2 TP, +11 TN images; RGB: +2 TP, +4 TN images). The thresholding and edge-detection models had reduced performance compared to the RGB models (20-epoch F1 score to RGB: thresholding −0.723%, edge detection −0.402%). This suggests that DL crack detection does not rely on colour. Hence, the model has implications for the automated crack detection of concrete infrastructures and the enhanced reliability of the gathered information.
... In Eq. 7, Φ i is the i-th layer of a pre-trained VGG-16 or VGG-19 network (Simonyan and Zisserman, 2015), and Φ i (x) is the feature map of input image x. In the actual data flow, the shape of the feature map is the same as mentioned previously: B × C × H × W. N is the total number of VGG network layers. ...
Article
Full-text available
Long-term live-cell imaging technology has emerged in the study of cell culture and development, and it is expected to elucidate the differentiation or reprogramming morphology of cells and the dynamic process of interaction between cells. There are some advantages to this technique: it is noninvasive, high-throughput, low-cost, and it can help researchers explore phenomena that are otherwise difficult to observe. Many challenges arise in the real-time process, for example, low-quality micrographs are often obtained due to unavoidable human factors or technical factors in the long-term experimental period. Moreover, some core dynamics in the developmental process are rare and fleeting in imaging observation and difficult to recapture again. Therefore, this study proposes a deep learning method for microscope cell image enhancement to reconstruct sharp images. We combine generative adversarial nets and various loss functions to make blurry images sharp again, which is much more convenient for researchers to carry out further analysis. This technology can not only make up the blurry images of critical moments of the development process through image enhancement but also allows long-term live-cell imaging to find a balance between imaging speed and image quality. Furthermore, the scalability of this technology makes the methods perform well in fluorescence image enhancement. Finally, the method is tested in long-term live-cell imaging of human-induced pluripotent stem cell-derived cardiomyocyte differentiation experiments, and it can greatly improve the image space resolution ratio.
... Various pre-trained models were employed in the construction of the model during the experiment to compare the accuracy of each model. VGG-19 (Simonyan et al. 2015) was chosen to be compared with DenseNet-169. Both models were used with the same fully connected layers mentioned and all hyperparameters were similar. ...
Article
Full-text available
Image-based inspection is a growing area with a large scope of automation. The automatic classification of vehicle damages would make the insurance claim much faster and more efficient. This can effectively reduce the claiming cost. This paper presents, an image classification model using an adapted version of pre-trained convolutional neural networks. The pre-trained neural networks were, the VGG-19 and DenseNet-169. The proposed model is a pipeline that established with fully connected layers for additional damage classification. The final proposed model improves the feature extraction process. The dataset had a class imbalance problem, so a weighted loss function had been used to solve such problem. The model employed binary cross-entropy as a loss function, and sigmoid activation was applied to the output layers as independent layers. Finally, the model presents a multi-label classifier, where one image may be assigned to many labels. The model classifies vehicle damage through five classes: broken glass, broken headlights, broken taillights, scratches, and dents. A four-layer neural network was employed for the classification, along with several regularization approaches to handle overfitting problem. The final results showed that the DenseNet-169 had a better accuracy of 81%, whereas VGG-19 had a 78%. Another approach had been proposed where it had a mix of transfer and ensemble learning approaches. This final approach had an accuracy of 85.5% and F1-scores of 0.855.
... Semantic segmentation has been widely studied in various fields, such as autonomous driving [2,3,9,17] and environment modeling [7,25,28]. Most existing methods are specially designed for RGB images [1,4,24,27]. However, these methods based on RGB images are difficult to achieve satisfactory performance owing to the RGB images containing very little available information in dimness and darkness. ...
Article
Full-text available
Semantic segmentation is a basic task in computer vision, which is widely used in various fields such as autonomous driving, detection, augmented reality and so on. Recent advances in deep learning have achieved commendable results in the semantic segmentation for visible (RGB) images. However, the performance of these methods will decline precipitously in dark or visually degraded environments. To overcome this problem, thermal images are introduced into semantic segmentation tasks because they can be captured in all-weather. To make full use of the two modal information, we propose a novel cross-guided fusion attention network for RGB-T semantic segmentation, which uses an attention mechanism to extract the weights of two modalities and guide each other. Then we extract the global information and add it to the decoding process. In addition, we propose a dual decoder to decode the features of different modalities. The two decoders will predict three prediction maps. We use these three prediction maps to train our network jointly. We conducted experiments on two public datasets, and the experimental results demonstrate the effectiveness of the proposed method.
... In recent years, deep learning algorithms based on convolutional neural networks (CNNs) have been widely used in the field of machine vision, which not only solves the problem of slow calculation speed but also significantly improves accuracy. Xie et al. [7] proposed a Holistically-nested Edge Detection (HED), which fused the side output of the five layers of VGGNet [21] through an up-sampling method, and proved the efficiency of the deep learning model in edge detection. To increase the utilization of information, Liu et al. [22] proposed Richer Convolutional Features For Edge Detection (RCF) based on HED. ...
Article
Full-text available
Edge detection is a key step in various image processing tasks. Edge detection based on deep learning is usually composed of encoding and decoding networks. Encoding networks are usually built based on classifiers (e.g., VGG16) while focusing on the construction of decoding networks. In this paper, an encoding–decoding network is proposed by simulating the visual pathway of the retina-lateral geniculate nucleus (LGN)- the primary visual cortex (V1)-V2-V4- the inferior temporal cortex. Bio-inspired Feature Cascade Network (BFCN) was designed to simulate the transmission modes of feedforward propagation, horizontal connection, and feedback propagation among neurons in the IT, which is conducive to enhancing the characteristic analysis ability of the decoding network. Firstly, to simulate the information processing model of feedforward propagation, a Feedforward Propagation Network is designed to fully fuse the underlying information. Secondly, to simulate the information processing model of the horizontal connection between neurons, the Inter-Layer Information module (ILI) is designed to process the interlayer information of FPNet, which is beneficial to enhancing the feature extraction ability. Finally, to simulate the feedback propagation, the Proximity Combination Network (PCNet) is designed to integrate the feature prediction of each stage and strengthen the generalization ability of the network. Experimental results show that the proposed contour detection model outperforms current similar models.
... The feature extractor reduces the dimensionality of the image thus taking out redundant information and keeping only useful one. Resnet-18 [6] was used for this purpose since it gave the best results compared to the VGG [15] network. The output consists of three feature vectors corresponding to the three images that were fed into the network. ...
Chapter
Facial expressions play an important role in human communication since they enrich spoken information and help convey additional sentiments e.g. mood. Among others, they non-verbally express a partner’s agreement or disagreement to spoken information. Further, together with the audio signal, humans can even detect nuances of changes in a person’s mood. However, facial expressions remain inaccessible to the blind and visually impaired, and also the voice signal alone might not carry enough mood information. Emotion recognition research mainly focused on detecting one of seven emotion classes. Such emotions are too detailed, and having an overall impression of primary emotional states such as positive, negative, or neutral is more beneficial for the visually impaired person in a lively discussion within a team. Thus, this paper introduces an emotion recognition system that allows a real-time detection of the emotions “agree”, “neutral”, and “disagree”, which are seen as the most important ones during a lively discussion. The proposed system relies on a combination of neural networks that allow extracting emotional states while leveraging the temporal information from videos.
... An acclaimed example of a large data set that fostered further development in this area is ImageNet (Deng et al. 2009) around which a competition called ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was organized. This competition enabled tracking the performance of various CNN models over time and spurred the development of several renowned architectures like VGG (Simonyan and Zisserman 2015), GoogleNet (Szegedy et al. 2015) or ResNet (He et al. 2016a). Throughout the years one could observe how the classification top-5 error plummets from more than 25% in 2010-2011, when shallow methods were used, to less than 5% in 2015 with the use of DL. ...
Article
Full-text available
Deep Convolutional Neural Networks have made an incredible progress in many Computer Vision tasks. This progress, however, often relies on the availability of large amounts of the training data, required to prevent over-fitting, which in many domains entails significant cost of manual data labeling. An alternative approach is application of data augmentation (DA) techniques that aim at model regularization by creating additional observations from the available ones. This survey focuses on two DA research streams: image mixing and automated selection of augmentation strategies. First, the presented methods are briefly described, and then qualitatively compared with respect to their key characteristics. Various quantitative comparisons are also included based on the results reported in recent DA literature. This review mainly covers the methods published in the materials of top-tier conferences and in leading journals in the years 2017–2021.
... Bottom right: Gaussian local region weighting from an attention map. The number over each layer represents the number of channels (Simonyan & Zisserman, 2015) ...
Article
Full-text available
In this paper, we propose a generative inpainting-based method to detect anomalous images in human monitoring via self-supervised multi-task learning. Our previous methods, where a deep captioning model is employed to find salient regions in an image and exploit caption information for each of them, detect anomalies in human monitoring at region level by considering the relations of overlapping regions. Here, we focus on image-level detection, which is preferable when humans prefer an immediate alert and handle them by themselves. However, in such a setting, the methods could show their deficiencies due to their reliance on the salient regions and their neglect of non-overlapping regions. Moreover, they take all regions equally important, which causes the performance to be easily influenced by unimportant regions. To alleviate these problems in image-level detection, we first employ inpainting techniques with a designed local and global loss to better capture the relation between a region and its surrounding area in an image. Then, we propose an attention-based Gaussian weighting anomaly score to combine all the regions by considering their importance for mitigating the influences of unimportant regions. The attention mechanism exploits multi-task learning for higher accuracy. Extensive experiments on two real-world datasets demonstrate the superiority of our method in terms of AUROC, precision, and recall over the baselines. The AUROC has improved from 0.933 to 0.989 and from 0.911 to 0.953 compared with the best baseline on the two datasets.
Chapter
The performance of deep learning in the field of computer vision is better than traditional machine learning technology, and the image classification problem is one of the most prominent research topics. The methods of computer vision are used in industry production. We proposed an image classification model which segments the image at different scales on the basis of Deep-ViT to obtain image information of different scales. When this model is applied to the classification of tube head shapes, the accuracy rate can reach more than 90%, and towards the classification of tube head materials, the accuracy is about 98%.
Article
Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing less overhead. Through extensive experiments on image and audio classification as well as audiovisual crowd counting, we show that our method works for both classification and regression problems, and in both single- and multi-modal settings. Additionally, we introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis, that can lead to a more fine-grained dynamic inference.
Article
Full-text available
Accidental loss of radioactive sources will pose a great threat to social security and the national economy, and may cause mass casualties and serious social panic. This paper explores a localization method of multiple unknown radioactive sources based on the convolutional neural network (CNN) algorithm to search for lost or unknown radioactive sources. The energy deposition distribution of multiple gamma (γ) radioactive sources in an area is obtained by Geant4 simulation. Additionally, the energy deposition values in the area are randomly collected by using the self-avoiding walk (SAW) algorithm, and three datasets of training, validation and test are constructed. The collected data can be trained and analyzed by a convolutional neural network to determine the locations of radioactive sources. Further, in order to verify the performance of the algorithm, group experiments are carried out, including the existence of obstacles in the area, the length of obstacles, the number of radioactive sources, etc. The experimental results show that the respective locations of two radioactive sources can be predicted with 89% accuracy by only collecting 10 energy deposition values in the area. In the case of complex obstacles, the accuracy can reach 68%. Besides, the respective locations of three radioactive sources can be predicted with at least 86% accuracy in a restricted area. The experimental results show the feasibility of the proposed method for the localization of multiple unknown radioactive sources.
Article
Full-text available
As opposed to macro-expressions, micro-expressions are subtle and not easily detectable emotional expressions, often containing rich information about mental activities. The practical recognition of micro-expressions is essential in interrogation and healthcare. Neural networks are currently one of the most common approaches to micro-expression recognition. Still, neural networks often increase their complexity when improving accuracy, and overly large neural networks require extremely high hardware requirements for running equipment. In recent years, vision transformers based on self-attentive mechanisms have achieved accuracy in image recognition and classification that is no less than that of neural networks. Still, the drawback is that without the image-specific biases inherent to neural networks, the cost of improving accuracy is an exponential increase in the number of parameters. This approach describes training a facial expression feature extractor by transfer learning and then fine-tuning and optimizing the MobileViT model to perform the micro-expression recognition task. First, the CASME II, SAMM, and SMIC datasets are combined into a compound dataset, and macro-expression samples are extracted from the three macro-expression datasets. Each macro-expression sample and micro-expression sample are pre-processed identically to make them similar. Second, the macro-expression samples were used to train the MobileNetV2 block in MobileViT as a facial expression feature extractor and to save the weights when the accuracy was highest. Finally, some of the hyperparameters of the MobileViT model are determined by grid search and then fed into the micro-expression samples for training. The samples are classified using an SVM classifier. In the experiments, the proposed method obtained an accuracy of 84.27%, and the time to process individual samples was only 35.4 ms. Comparative experiments show that the proposed method is comparable to state-of-the-art methods in terms of accuracy while improving recognition efficiency.
Article
Full-text available
Magnetic Resonance Imaging (MRI) is the one of the most frequently used diagnosis tool to detect and classify abnormalities in the brain. Automatic classification and detection is a difficult and complex task for a radiologist or clinical practitioner for identification and extraction of infected tumor areas from the MRI (Magnetic Resonance Imaging).Further, classifying the type of tumor from Magnetic Resonance (MR) images also a vital part of the diagnostic system. Factors like size, shape, and position of tumor vary from different patient's brain. Many efforts have been made for image detection and classification, but getting accurate automated techniques taken higher computational time. Motivated by the above difficulties, this paper proposes S-Transform based Discrete Orthonormal S-Transform (DOST) segmentation technique to improve the performance of detection process. The DOST also utilized for feature extraction of the image. Further, a SCA (Sine Cosine Algorithm) based DCNN (Deep Convolutional Neural Network) model has been developed for classification of brain tumors into malignant (cancerous) and benign (noncancerous) category. The SCA has been utilized for weight optimization in the fully connected layer of the DCNN model. Also the different category of hidden neuron functions at the hidden layer has been tested with the new hybrid SCA-DCCN 2304 model and comparison results are presented. In this research work an effort has been made improve the accuracy of diagnosis process.
Article
We propose and analyze a framework to detect and identify the mitotic type staining patterns among different non-mitotic (interphase) patterns on HEp-2 cell substrate specimen images. This is considered as a principal task in computer-aided diagnosis (CAD) of the autoimmune disorders. Due to the rare appearance of mitotic patterns in whole slide/specimen images, the sample skew between mitotic and non-mitotic patterns is an important consideration.We suggest to apply some effective samples skew balancing strategies for the task of classification between mitotic v/s interphase patterns. Another aspect of this study is to consider the morphology and texture-based differences between both the classes that can be incorporated through effective morphology and texture-based descriptors, including the Gabor and LM (Leung-Malik) filter banks and also through some contemporary filter banks derived from convolutional neural networks (CNN).The proposed framework is evaluated on a public dataset and we demonstrate good performance (0.99 or 1 Matthews correlation coefficient (MCC) in many cases), across various experiments. The study also presents a comparison between hand-engineered and CNN-based feature representation, along with the comparisons with state-of-the-art approaches. Hence, the framework proves to be a good solution for the mentioned skewed classification problem.
Article
Drill core lithology is an important indicator reflecting the geological conditions of the drilling area. Traditional lithology identification usually relies on manual visual inspection, which is time-consuming and professionally demanding. In recent years, the rapid development of convolutional neural networks has provided an innovative way for the automatic prediction of drill core images. In this work, a core dataset containing a total of 10 common lithology categories in underground engineering was constructed. ResNeSt-50 we adopted uses a strategy of combining channel-wise attention and multi-path network to achieve cross-channel feature correlations, which significantly improves the model accuracy without high model complexity. Transfer learning was used to initialize the model parameters, to extract the feature of core images more efficiently. The model achieved superior performance on testing images compared with other discussed CNN models, the average value of its Precision, Recall, F 1−score for each category of lithology is 99.62%, 99.62%, and 99.59%, respectively, and the prediction accuracy is 99.60%. The test results show that the proposed method is optimal and effective for automatic lithology classification of borehole cores.
Article
The detection of surface defects on automotive engine parts is an important part of automobile manufacturing quality assurance. The traditional detection methods rely on manual inspection and can be inaccurate and inefficient, while the existing deep learning-based methods, such as the Mask R-CNN detection method, have insufficient precision for detecting minor defects since the anchor scales design does not consider small defects. To overcome these shortcomings, this paper proposes an IA-Mask R-CNN detection method with an improved anchor scales design. First, an image dataset that contains 560 pictures of surface defects of automotive engine parts is established using a 1080P HDMI high-definition digital microscope capable of recording three million real pixels and labeled manually. Then, the anchor scales suitable for the surface defect detection of automotive engine parts are determined by labeled data analysis and used to improve the anchor design in Mask R-CNN. Finally, the proposed method is compared experimentally with the Faster R-CNN and Mask R-CNN, and qualitative and quantitative analyses are conducted. The experimental results show that, without increasing the number of parameters or training time of the Mask R-CNN, the proposed method performed better in detecting minor, as well as larger defects than the other detection methods.
Article
To obtain the motion state of a school of fish in marine cage culture and improve the automatic detection capability for large areas of marine culture, this paper proposes a fish-motion vector-field extraction method for marine cage culture, which primarily comprises cage region extraction and motion vector calculation. The proposed method uses an unmanned aerial vehicle (UAV) as the carrier, and the interference of marine environment is fully considered. First, blurred images are discarded, and the culture area is obtained by a holistically-nested edge detection (HED) network. Second, the validity of the current frame is determined by the local binary patterns (LBPs) of the feature image. Third, the images are transformed, and the vector-field information of fish swarm is obtained using the particle image velocimetry (PIV) analysis method. This study integrates traditional image analysis and deep network technology while also fully considering the image distortion caused by UAV flight attitude changes in the actual detection process. Finally, images of Takifugu rubripe net cages were used to verify the method. The results show that the proposed method has strong practicability and applicability.
Article
Moving object detection is the foundation of research in many computer vision fields. In recent decades, a number of detection methods have been proposed. Relevant surveys mainly focused on the detection accuracy, while different practical detection tasks were not considered. However, in different application tasks, training modes and requirements are completely different. The purpose of this survey is to classify and evaluate recent moving object detection methods from a practical perspective. Two main types of practical application tasks are considered: the detection of seen scenes and the detection of unseen scenes. In the survey, two practical application tasks are defined, corresponding recent moving object detection technologies are reviewed, and future directions are suggested to provide references for researchers and technicians when choosing suitable algorithms in practical work.
Article
Siamese network trackers based on pre-trained depth features have achieved good performance in recent years. However, the pre-trained depth features are trained in advance on large-scale datasets, which contain feature information of a large number of objects. There may be a pair of interference and redundant information for a single tracking target. To learn a more accurate target feature information, this paper proposes a lightweight target-aware attention learning network to learn the most effective channel features of the target online. The lightweight network uses a designed attention learning loss function to learn a series of channel features with weights online with no complex parameters. Compared with the pre-trained features, the channel features with weights can represent the target more accurately. Finally, the lightweight target-aware attention learning network is unified into a Siamese tracking network framework to implement target tracking effectively. Experiments on several datasets demonstrate that the tracker proposed in this paper has good performance.
Article
Secure multi-party computation (MPC) allows a set of parties to jointly compute a function on their private inputs, and reveals nothing but the output of the function. In the last decade, MPC has rapidly moved from a purely theoretical study to an object of practical interest, with a growing interest in practical applications such as privacy-preserving machine learning (PPML). In this paper, we comprehensively survey existing work on concretely efficient MPC protocols with both semi-honest and malicious security, in both dishonest-majority and honest-majority settings. We focus on considering the notion of security with abort, meaning that corrupted parties could prevent honest parties from receiving output after they receive output. We present high-level ideas of the basic and key approaches for designing different styles of MPC protocols and the crucial building blocks of MPC. For MPC applications, we compare the known PPML protocols built on MPC, and describe the efficiency of private inference and training for the state-of-the-art PPML protocols. Furthermore, we summarize several challenges and open problems to break though the efficiency of MPC protocols as well as some interesting future work that is worth being addressed. This survey aims to provide the recent development and key approaches of MPC to researchers, who are interested in knowing, improving, and applying concretely efficient MPC protocols.
Article
Full-text available
Edge detection is one of the most important and fundamental problems in the field of computer vision and image processing. Edge contours extracted from images are widely used as critical cues for various image understanding tasks such as image segmentation, object detection, image retrieval, and corner detection. The purpose of this paper is to review the latest developments on image edge detection. Firstly, the definition and properties of edges are introduced. Secondly, the existing edge detection methods are classified and introduced in detail. Thirdly, the existing widely used datasets and evaluation criteria for edge detection methods are summarized. Finally, future research directions for edge detection are elaborated.
Conference Paper
Full-text available
One of the most common and deadly forms of skin cancer, melanoma is responsible for 75% of all skin cancer fatalities. The likelihood of survival is significantly increased by melanoma early identification. A crucial and essential step in the accurate diagnosis of melanoma is melanoma segmentation. Numerous earlier efforts based on common segmentation algorithms and deep learning techniques have been presented for high-resolution photos. Due of the intrinsic visual complexity and ambiguity among various skin states, automatic melanoma segmentation remains a challenging issue for current algorithms. Due to its great performance through training an end-to-end framework that doesn't require human contact, deep learning approaches have recently attracted more attention among these techniques. A well-liked deep learning model for medical picture segmentation is called U-net. In this study, we demonstrate an effective skin lesion segmentation based on an enhanced U-net model. The recommended method can reach state-of-the-art performance on the skin lesion segmentation problem, according to tests utilizing the 2016 ISIC Challenge melanoma dataset.
Chapter
The existence of deep learning’s “black box” makes it difficult to understand how the algorithms analyze patterns and make image-level predictions. A representation of the pixels contributing the most to the algorithm’s classification will require new insights. Classification methods for neurodegenerative ocular disorders have been developed using machine learning and image processing techniques with considerable efficacy. However, the techniques’ robustness and transferability remain uncertain. We have developed a new classification method based upon information bottleneck to analyze the attribution of each feature and the information provided for the network prediction in each input area for a clear understanding of the affectability of a black-box model. In this research, we apply the attribution information bottleneck and limit the information flow and assess the amount of information image areas produce in bits by allowing noise to intermediate feature maps. Our studies indicate that the information bottleneck for attribution (IBA) has increased model interpretability and gives a more reliable estimation through a publicly available dataset.
Article
With the rapid development of remote sensing with small, lightweight unmanned aerial vehicles (UAV), efficient and accurate crop spike counting, and yield estimation methods based on deep learning (DL) methods have begun to emerge, greatly reducing labor costs and enabling fast and accurate counting of sorghum spikes. However, there has not been a systematic, comprehensive evaluation of their applicability in cereal crop spike identification in UAV images, especially in sorghum head counting. To this end, this paper conducts a comparative study of the performance of three common DL algorithms, EfficientDet, Single Shot MultiBox Detector (SSD), and You Only Look Once (YOLOv4), for sorghum head detection based on lightweight UAV remote sensing data. The paper explores the effects of overlap ratio, confidence, and intersection over union (IoU) parameters, using the evaluation metrics of precision P, recall R, average precision AP, F1 score, computational efficiency, and the number of detected positive/negative samples (Objects detected consistent/inconsistent with real samples). The experiment results show the following. (1) The detection results of the three methods under dense coverage conditions were better than those under medium and sparse conditions. YOLOv4 had the most accurate detection under different coverage conditions; on the contrary, EfficientDet was the worst. While SSD obtained better detection results under dense conditions, the number of over-detections was larger. (2) It was concluded that although EfficientDet had a good positive sample detection rate, it detected the fewest samples, had the smallest R and F1, and its actual precision was poor, while its training time, although medium, had the lowest detection efficiency, and the detection time per image was 2.82-times that of SSD. SSD had medium values for P, AP, and the number of detected samples, but had the highest training and detection efficiency. YOLOv4 detected the largest number of positive samples, and its values for R, AP, and F1 were the highest among the three methods. Although the training time was the slowest, the detection efficiency was better than EfficientDet. (3) With an increase in the overlap ratios, both positive and negative samples tended to increase, and when the threshold value was 0.3, all three methods had better detection results. With an increase in the confidence value, the number of positive and negative samples significantly decreased, and when the threshold value was 0.3, it balanced the numbers for sample detection and detection accuracy. An increase in IoU was accompanied by a gradual decrease in the number of positive samples and a gradual increase in the number of negative samples. When the threshold value was 0.3, better detection was achieved. The research findings can provide a methodological basis for accurately detecting and counting sorghum heads using UAV.
Article
Rooftop solar photovoltaic (PV) retrofitting can greatly reduce the emissions of greenhouse gases, thus contributing to carbon neutrality. Effective assessment of carbon emission reduction has become an urgent challenge for the government and for business enterprises. In this study, we propose a method to assess accurately the potential reduction of long-term carbon emission by installing solar PV on rooftops. This is achieved using the joint action of GF-2 satellite images, Point of Interest (POI) data, and meteorological data. Firstly, we introduce a building extraction method that extends the DeepLabv3+ by fusing the contextual information of building rooftops in GF-2 images through multi-sensory fields. Secondly, a ridgeline detection algorithm for rooftop classification is proposed, based on the Hough transform and Canny edge detection. POI semantic information is used to calculate the usable area under different subsidy policies. Finally, a multilayer perceptron (MLP) is constructed for long-term PV electricity generation series with regional meteorological data, and carbon emission reduction is estimated for three scenarios: the best, the general, and the worst. Experiments were conducted with GF-2 satellite images collected in Daxing District, Beijing, China in 2021. Final results showed that: (1) The building rooftop recognition method achieved overall accuracy of 95.56%; (2) The best, the general and the worst amount of annual carbon emission reductions in the study area were 7,705,100 tons, 6,031,400 tons, and 632,300 tons, respectively; (3) Multi-source data, such as POIs and climate factors play an indispensable role for long-term estimation of carbon emission reduction. The method and conclusions provide a feasible approach for quantitative assessment of carbon reduction and policy evaluation.
Article
Autonomous landing is a fundamental aspect of drone operations which is being focused upon by the industry, with ever-increasing demands on safety. As the drones are likely to become indispensable vehicles in near future, they are expected to succeed in automatically recognizing a landing spot from the nearby points, maneuvering toward it, and ultimately, performing a safe landing. Accordingly, this paper investigates the idea of vision-based location detection on the ground for an automated emergency response system which can continuously monitor the environment and spot safe places when needed. A convolutional neural network which learns from image-based feature representation at multiple scales is introduced. The model takes the ground images, assign significance to various aspects in them and recognize the landing spots. The results provided support for the model, with accurate classification of ground image according to their visual content. They also demonstrate the feasibility of computationally inexpensive implementation of the model on a small computer that can be easily embedded on a drone.
Article
Full-text available
To support the ongoing size reduction in integrated circuits, the need for accurate depth measurements of on-chip structures becomes increasingly important. Unfortunately, present metrology tools do not offer a practical solution. In the semiconductor industry, critical dimension scanning electron microscopes (CD-SEMs) are predominantly used for 2D imaging at a local scale. The main objective of this work is to investigate whether sufficient 3D information is present in a single SEM image for accurate surface reconstruction of the device topology. In this work, we present a method that is able to produce depth maps from synthetic and experimental SEM images. We demonstrate that the proposed neural network architecture, together with a tailored training procedure, leads to accurate depth predictions. The training procedure includes a weakly supervised domain adaptation step, which is further referred to as pixel-wise fine-tuning. This step employs scatterometry data to address the ground-truth scarcity problem. We have tested this method first on a synthetic contact hole dataset, where a mean relative error smaller than 6.2% is achieved at realistic noise levels. Additionally, it is shown that this method is well suited for other important semiconductor metrics, such as top critical dimension (CD), bottom CD and sidewall angle. To the extent of our knowledge, we are the first to achieve accurate depth estimation results on real experimental data, by combining data from SEM and scatterometry measurements. An experiment on a dense line space dataset yields a mean relative error smaller than 1%.
Article
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural-network that operates directly off of the image pixels. This model is configured with 11 hidden layers all with feedforward connections. We employ the DistBelief implementation of deep neural networks to scale our computations over this network. We have evaluated this approach on the publicly available SVHN dataset and achieve over 96% accuracy in recognizing street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art and achieve 97.84% accuracy. We also evaluated this approach on an even more challenging dataset generated from Street View imagery containing several 10s of millions of street number annotations and achieve over 90% accuracy. Our evaluations further indicate that at specific operating thresholds, the performance of the proposed system is comparable to that of human operators and has to date helped us extract close to 100 million street numbers from Street View imagery worldwide.
Article
Full-text available
We investigate multiple techniques to improve upon the current state of the art deep convolutional neural network based image classification pipeline. The techiques include adding more image transformations to training data, adding more transformations to generate additional predictions at test time and using complementary models applied to higher resolution images. This paper summarizes our entry in the Imagenet Large Scale Visual Recognition Challenge 2013. Our system achieved a top 5 classification error rate of 13.55% using no external data which is over a 20% relative improvement on the previous year's winner.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Article
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.
Article
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013].
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.