Article

# MatConvNet - Convolutional Neural Networks for MATLAB

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

MatConvNet is an implementation of Convolutional Neural Networks (CNNs) for MATLAB. The toolbox is designed with an emphasis on simplicity and flexibility. It exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing routines for computing linear convolutions with filter banks, feature pooling, and many more. In this manner, MatConvNet allows fast prototyping of new CNN architectures; at the same time, it supports efficient computation on CPU and GPU allowing to train complex models on large datasets such as ImageNet ILSVRC. This document provides an overview of CNNs and how they are implemented in MatConvNet and gives the technical details of each computational block in the toolbox.

## No full-text available

... , 10}. The MatCon-vNet [61] implementation of the ResNet-152 network is used for evaluation. Unless otherwise specified, the experiments are conducted on a PC with 16GB memory and an i7 CPU and without GPU acceleration. ...
... We should also note that, in the above experiments, the classification accuracy of the original ResNet-152 network for the occlusion-free query images is slightly worse than the result reported in [8]. This may be caused by the inaccuracy in the re-implementation of the ResNet-152 network in MatConvNet [61]. Nevertheless, such a small difference does not devalue the merits of the proposed SDBE-based classification scheme. ...
Preprint
Classification of partially occluded images is a highly challenging computer vision problem even for the cutting edge deep learning technologies. To achieve a robust image classification for occluded images, this paper proposes a novel scheme using subspace decomposition based estimation (SDBE). The proposed SDBE-based classification scheme first employs a base convolutional neural network to extract the deep feature vector (DFV) and then utilizes the SDBE to compute the DFV of the original occlusion-free image for classification. The SDBE is performed by projecting the DFV of the occluded image onto the linear span of a class dictionary (CD) along the linear span of an occlusion error dictionary (OED). The CD and OED are constructed respectively by concatenating the DFVs of a training set and the occlusion error vectors of an extra set of image pairs. Two implementations of the SDBE are studied in this paper: the $l_1$-norm and the squared $l_2$-norm regularized least-squares estimates. By employing the ResNet-152, pre-trained on the ILSVRC2012 training set, as the base network, the proposed SBDE-based classification scheme is extensively evaluated on the Caltech-101 and ILSVRC2012 datasets. Extensive experimental results demonstrate that the proposed SDBE-based scheme dramatically boosts the classification accuracy for occluded images, and achieves around $22.25\%$ increase in classification accuracy under $20\%$ occlusion on the ILSVRC2012 dataset.
... The proposed model is implemented using MatConvNet toolbox [21]. Gaussian distribution [22] with standard deviation 2/N is used to initialize the weights of the network, and all the bias are initialized to zero. ...
Conference Paper
Full-text available
Light field (LF) imaging, which can capture spatial and angular information of light-rays in one shot, has received increasing attention. However, the well-known LF spatio-angular trade-off problem has restricted many applications of LF imaging. In order to alleviate this problem, this paper put forward a dual-level LF reconstruction network to improve LF angular resolution with sparsely-sampled LF inputs. Instead of using 2D or 3D LF representation in reconstruction process, this paper propose an LF directional EPI volume representation to synthesize the full LF. The proposed LF representation can encourage an interaction of spatial-angular dimensions in convolutional operation , which is benefit for recovering the lost texture details in synthesized sub-aperture images (SAIs). In order to extract the high-dimensional geometric features of the angular mapping from low angular resolution inputs to high angular full LF, a dual-level deep network is introduced. The proposed deep network consists of an SAI synthesis sub-network and a detail refinement sub-network, which allows LF reconstruction in a dual-level constraint (i.e., from coarse to fine). Our network model is evaluated on several real-world LF scenes datasets, and extensive experiments validate that the proposed model outperforms the state-of-the-arts and achieves a better reconstruct SAIs perceptual quality as well.
... Matlab is used as the main environment to construct and evaluate the proposed classification model. VLFeat libraries and MatConvNet toolbox [54] are used to build the FV-SUFR-HOG and CNN classifiers. The dataset is collected from the web and organized in seven categories that represent the as presented in our work [47]. ...
... By leveraging this property, and by mapping multiple CSI series into a matrix, we are able to apply a similar technique from the field of image recognition and sentence classification [21][22][23], i.e, convolutional NN (CNN) to detect the pattern of CSI variation. CNNs treat the feature extraction and the classification identically; in particular, feature extraction is implemented by convolution layers and classification is approached by full-connection layers [24,25]. As the shared weights in convolution layers and the weights in fullconnection layers are trained together, the total classification error of a well designed CNN can be significantly minimized [26]. ...
Article
Full-text available
To support the ever increasing number of devices in massive multiple-input multiple-output (mMIMO) systems, an excessive amount of overhead is required for conventional orthogonal pilot-based channel estimation schemes. To circumvent this fundamental constraint, we design a machine learning (ML)-based time-division duplex scheme in which channel state information (CSI) can be obtained by leveraging the temporal channel correlation. The presence of the temporal channel correlation is due to the stationarity of the propagation environment across time. The proposed ML-based predictors involve a pattern extraction implemented via a convolutional neural network, and a CSI predictor realized by an autoregressive (AR) predictor or an autoregressive network with exogenous inputs recurrent neural network. Closed-form expressions for the user uplink and downlink achievable spectral efficiency and average per-user throughput are provided for the ML-based time division duplex schemes. Our numerical results demonstrate that the proposed ML-based predictors can remarkably improve the prediction quality for both low and high mobility scenarios, and offer great performance gains on the per-user achievable throughput.
... The momentum is set to 0.9 and weight decay 0.0005. In addition, we realise the 1D CNNs and SVM based on MatConvNet with the version 1.0-beta20 [44] and liblinear with the version 1.96 [45], respectively. Fig. 7 shows some colour examples of real face and 3D face mask. ...
Article
Full-text available
Face presentation attacks have become a major threat against face recognition systems and many countermeasures have been proposed over the past decade. However, most of them are devoted to 2D face presentation attack detection, rather than 3D face masks. Unlike the real face, the 3D face mask is usually made of resin materials and has a smooth surface, resulting in reflectance differences. Therefore, in this study, the authors propose a novel 3D face mask presentation attack detection method based on analysis of image reflectance. In the proposed method, the face image is first processed with intrinsic image decomposition algorithm to compute its reflectance image. Then, the intensity distribution histograms are extracted from three orthogonal planes to represent the intensity differences of reflectance images between the real face and 3D face mask. After that, given that the reflectance image of a smooth surface is more sensitive to illumination changes, 1D convolutional neural network is used to characterise how different materials or surfaces react differently to illumination changes. Extensive experiments with the public available 3DMAD database demonstrate the effectiveness of the proposed method for distinguishing a face mask from the real one and show that the detection performance outperforms other state‐of‐the‐art methods.
... The size of the block depends on the input size of the deep learning model, typically are 256, 128 or 32. For example, in [22], the authors use MatConvNet [23] to classify the input pavement 256 × 256 images into defective and non-defective. In [24], Shengyuan et al. modify the structure of GoogleNet [25] to do the classification. ...
Article
Full-text available
Industrial product surface defect detection is very important to guarantee high product quality and production efficiency. In this work, we propose a regression and classification based framework for generic industrial defect detection. Specifically, the framework consists of four modules: deep regression based detection model, pixel-level false positive reduction, connected component analysis and deep network for defect type classification. To train the detection model, we propose a high performance deep network structure and an algorithm to generate label data to capture the defect severity information from data annotation. We have tested the method on two public benchmark datasets, AigleRN and DAGM2007, and an in-house capacitor image dataset. The results have shown that our method can achieve the state-of-the-art performance in terms of detection accuracy and efficiency.
... The patches rotation and flip were used for data augmentation. We accomplished the training and test procedure on a NVIDIA RTX 2080Ti GPU with a package named MatConvNet [41]. ...
Preprint
Full-text available
Networks with large receptive field (RF) have shown advanced fitting ability in recent years. In this work, we utilize the short-term residual learning method to improve the performance and robustness of networks for image denoising tasks. Here, we choose a multi-wavelet convolutional neural network (MWCNN), one of the state-of-art networks with large RF, as the backbone, and insert residual dense blocks (RDBs) in its each layer. We call this scheme multi-wavelet residual dense convolutional neural network (MWRDCNN). Compared with other RDB-based networks, it can extract more features of the object from adjacent layers, preserve the large RF, and boost the computing efficiency. Meanwhile, this approach also provides a possibility of absorbing advantages of multiple architectures in a single network without conflicts. The performance of the proposed method has been demonstrated in extensive experiments with a comparison with existing techniques.
... This is the motivation behind choosing 1D-CNN for our supervised classification. We have developed our 1D-CNN using MATLAB R2019a [11]. The architecture of a classical CNN is given in Fig. 3. ...
... In the work of [44], Li et al. modified GoogLeNet [45] to classify image blocks and realized crack detection on real pavement using smartphone. In [46], cha et al. used MatConvNet [47] to classify the input pavement 256 × 256 images. Similarly, in [43], the authors generated image patches of 99 × 99 from original pavement images, where the patch is defective if its center pixel is within 5 pixels of the crack center. ...
Article
Full-text available
Road pavement cracks detection has been a hot research topic for quite a long time due to the practical importance of crack detection for road maintenance and traffic safety. Many methods have been proposed to solve this problem. This paper reviews the three major types of methods used in road cracks detection: image processing, machine learning and 3D imaging based methods. Image processing algorithms mainly include threshold segmentation, edge detection and region growing methods, which are used to process images and identify crack features. Crack detection based traditional machine learning methods such as neural network and support vector machine still relies on hand-crafted features using image processing techniques. Deep learning methods have fundamentally changed the way of crack detection and greatly improved the detection performance. In this work, we review and compare the deep learning neural networks proposed in crack detection in three ways, classification based, object detection based and segmentation based. We also cover the performance evaluation metrics and the performance of these methods on commonly-used benchmark datasets. With the maturity of 3D technology, crack detection using 3D data is a new line of research and application. We compare the three types of 3D data representations and study the corresponding performance of the deep neural networks for 3D object detection. Traditional and deep learning based crack detection methods using 3D data are also reviewed in detail.
... In the experiment, we adopt dense SIFT local features and CNN local feature, respectively, to validate our method. All the algorithms are implemented by the VLFeat [48] and Matconvnet toolbox [49]. The setting for the CNN local feature will be discussed in Section 4.5. ...
Article
Full-text available
In the standard bag‐of‐visual‐words model, the relationship between visual words and geometric structure information embedding in Voronoi cells is important for expressing the topology of the feature space. However, this information is usually ignored by recent works. To overcome it, the authors proposed a hybrid heterogeneous structure model (HHSM), where local hyperspheres and local structure subspaces are applied to simulate the intrinsic structure of the feature space. Firstly, the local hypersphere is formed by choosing some links between parts of visual words, with the use of a proposed decision strategy derived from k‐dense neighbour algorithm. In order to capture the geometric structure information around the visual word, they then construct the local structure subspace with the transformed PCA principal vectors of the visual features within a Voronoi cell. Finally, this study introduces a novel feature encoding method based on the HHSM. Experiments are conducted on 15‐Scenes, Pascal VOC2007, Caltech101, Caltech256 and MIT Indoor 67 datasets, which include 4485, 9963, 9146, 30607 and 15620 images, respectively. The results demonstrate the effectiveness of the proposed method in improving the accuracy of the classification. In addition, the proposed method achieves comparable performance when combined with CNN local features.
... The algorithm is implemented in MATLAB and uses MatConvNet library [31]. We measure it on a desktop machine with Intel Core i7-4770 processor at 3.70 GHz, 8 GB RAM and a 12 GB NVIDIA GTX TITAN X GPU. ...
Article
Full-text available
Deep learning has been widely used in many visual recognition tasks owing to its powerful representation ability. However, online learning is a bottleneck to obstruct the application of deep learning in visual tracking. Although many algorithms have discarded the process of online learning during tracking, they demonstrate poor robustness to the online adaptation to appearance changes of the target. In this study, the authors design a tree structure specifically for online learning, which enables the appearance model to be updated smoothly. Once the target appearance has changed severely, a new branch is generated to avoid the fuzzy boundary of classification. In addition, active learning technique and artificial data are employed in the update to make the best of the limited knowledge about the interesting object during the tracking process. The proposed algorithm is evaluated on OTB2013 and VOT2017 benchmark and outperforms many state‐of‐the‐art methods.
... The AlexNet is selected because it is flexible for modification, able to reduce over fitting by using a dropout layer and the capable to train faster by using a rectified linear unit (ReLU). The AlexNet (Fig2) model that has been used is s a pre-trained network from the MatConvNet toolbox [27][28][29]. It consists of 25 layers with weights. ...
... 2) Development Environment: All the numerical experiments are performed on an AMAX workstation, which has 2 Intel Xeon E5-2640Wv4 CPUs with 10 cores, 2 NVIDIA Titan X GPUs, and a 128-GB memory. And MatConvNet [55] acts as our development platform. ...
Article
Full-text available
In high-resolution remote sensing image retrieval (HRRSIR), convolutional neural networks (CNNs) have an absolute performance advantage over the traditional hand-crafted features. However, some CNN-based HRRSIR models are classification-oriented, they pay no attention to similarity, which is critical to image retrieval; whereas others concentrate on learning similarity, failing to take full advantage of information about class labels. To address these issues, we propose a novel model called classification-similarity network (CSN), which aims for image classification and similarity prediction at the same time. In order to further improve performance, we build and train two CSNs, and two kinds of information from them, i.e., deep features and similarity scores, are consolidated to measure the final similarity between two images. Besides, the optimal fusion theorem in biometric authentication, which gives a theoretical scheme to make sure that fusion will definitely lead to a better performance, is used to conduct score fusion. Extensive experiments are carried out over publicly available datasets, demonstrating that CSNs are distinctly superior to usual CNNs and our proposed "two CSNs + feature fusion + score fusion" method outperforms the state-of-the-art models.
... The learning rate, batch size, and the number of epochs are set to 0.0001, 10, and 120 for DLRSD dataset, and set to 0.001, 32, and 120 for WHDLD dataset. Our FCN is implemented by using MatConvNet [31] as the deep learning framework, and trained with stochastic gradient decent as the optimizer. It is worth noting that we perform 420 queries for DLRSD dataset and 988 queries for WHDLD dataset, and the query image is also regarded a similar image during one query. ...
Article
Full-text available
Conventional remote sensing image retrieval (RSIR) system usually performs single-label retrieval where each image is annotated by a single label representing the most significant semantic content of the image. In this scenario, however, the scene complexity of remote sensing images is ignored, where an image might have multiple classes (i.e., multiple labels), resulting in poor retrieval performance. We therefore propose a novel multilabel RSIR approach based on fully convolutional network (FCN). Specifically, FCN is first trained to predict segmentation map of each image in the considered image archive. We then obtain multilabel vector and extract region convolutional features of each image based on its segmentation map. The extracted region features are finally used to perform region-based multilabel retrieval. The experimental results show that our approach achieves state-of-the-art performance in contrast to handcrafted and convolutional neural network features.
... The configuration of the computer used in the experiment is as follows: Inter(R) Core(TM) i5-6500 3.20GHz is the CPU; NVIDIA GTX 1080 with 8GB memory is the GPU. In the course of the experiment, we use the MatConvNet deep learning framework [34] and the Matlab version is R2016b. We logarithmically change the learning rate from 0.01 to 0.001, choose the batch size, patch size, momentum, number of epochs and the gradient clipping value to be 1, 256, 0.99, 151 and 10 −2 , respectively. ...
Article
Full-text available
As a low-end computed tomography (CT) system, translational CT (TCT) is in urgent demand in developing countries. Under some circumstances, in order to reduce the scan time, decrease the X-ray radiation or scan long objects, furthermore, to avoid the inconsistency of the detector for the large angle scanning, we use the limited-angle TCT scanning mode to scan an object within a limited angular range. However, this scanning mode introduces some additional noise and limited-angle artifacts that seriously degrade the imaging quality and affect the diagnosis accuracy. To reconstruct a high-quality image for the limited-angle TCT scanning mode, we develop a limited-angle TCT image reconstruction algorithm based on a U-net convolutional neural network (CNN). First, we use the SART method to the limited-angle TCT projection data, then we import the image reconstructed by SART method to a well-trained CNN which can suppress the artifacts and preserve the structures to obtain a better reconstructed image. Some simulation experiments are implemented to demonstrate the performance of the developed algorithm for the limited-angle TCT scanning mode. Compared with some state-of-the-art methods, the developed algorithm can effectively suppress the noise and the limited-angle artifacts while preserving the image structures.
... As all of the three CNNs were designed for a 1, 000-class classification task, the deep representations are extracted from the activations of the second fully connected layer fc7. Notably, the pre-trained AlexNet is obtained from MATLAB R2017a 3 , and the VGG-16 and VGG-19 models are from MatConvNet [200]. Next, the extracted features are fed into (B)GRU-RNNs with 120 and 60 neurons respectively with a tanh activation function. ...
Thesis
Full-text available
Automatically recognising audio signals plays a crucial role in the development of intelligent computer audition systems. Particularly, audio signal classification, which aims to predict a label for an audio wave, has promoted many real-life applications. Amounts of efforts have been made to develop effective audio signal classification systems in the real world. However, several challenges in deep learning techniques for audio signal classification remain to be addressed. For instance, training a deep neural network (DNN) from scratch is time-consuming to extracting high-level deep representations. Furthermore, DNNs have not been well explained to construct the trust between humans and machines, and facilitate developing realistic intelligent systems. Moreover, most DNNs are vulnerable to adversarial attacks, resulting in many misclassifications. To deal with these challenges, this thesis proposes and presents a set of deep-learning-based approaches for audio signal classification. In particular, to tackle the challenge of extracting high-level deep representations, the transfer learning frameworks, benefiting from pre-trained models on large-scale image datasets, are introduced to produce effective deep spectrum representations. Furthermore, the attention mechanisms at both the frame level and the time-frequency level are proposed to explain the DNNs by respectively estimating the contributions of each frame and each time-frequency bin to the predictions. Likewise, the convolutional neural networks (CNNs) with an attention mechanism at the time-frequency level is extended to atrous CNNs with attention, aiming to explain the CNNs by visualising high-resolution attention tensors. Additionally, to interpret the CNNs evaluated on multi-device datasets, the atrous CNNs with attention are trained in the conditional training frameworks. Moreover, to improve the robustness of the DNNs against adversarial attacks, models are trained in the adversarial training frameworks. Besides, the transferability of adversarial attacks is enhanced by a lifelong learning framework. Finally, the experiments conducted with various datasets demonstrate that these presented approaches are effective to address the challenges.
... The network parameters are initialized based on the method in [40]. It takes about 9.5 hours to train our model with the MatConvNet package [41] on a Nvidia GeForce GTX 1080 GPU. The source code and test results will be released after the publication of this work. ...
Article
Full-text available
Despite the significant advances in convolutional neural network (CNN) based image denoising, the existing methods still cannot consistently outperform non-local self-similarity (NSS) based methods, especially on images with many repetitive structures. Although several studies have been given to incorporate NSS priors with CNN-based denoising,their improvement is generally insignificant when compared with the state-of-the-art CNN-based denoisers. In this paper, we suggest to combine CNN and NSS based methods for improved image denoising, resulting in an NSS-UNet architecture. Motivated by gradient descent inference of TNRD, both the current estimate and noisy observation are considered as the inputs to the CNN. To take the NSS prior into account, the result by NSS (e.g., BM3D or WNNM), is adopted as the initial estimate. And a modified UNet is presented for exploiting the multi-scale information. We evaluate the proposed method on three common testing datasets. The results clearly show that NSS-UNet outperforms the existing CNN and NSS based methods in terms of both PSNR index and visual quality.
Thesis
Full-text available
Article
Full-text available
Age-invariant face recognition is one of the most crucial computer vision problems, e.g., in passport verification, surveillance systems, and missing individuals identification. The extraction of robust face features is a challenge since the facial characteristics change over age progression. In this paper, an age-invariant face recognition system is proposed, which includes four stages: preprocessing, feature extraction, feature fusion, and classification. Preprocessing stage detects faces using Viola–Jones algorithm and frontal face alignment. Feature extraction is achieved using a CNN architecture using VGG-Face model to extract compact face features. Extracted features are fused using the real-time feature-level multi-discriminant correlation analysis, which significantly reduces feature dimensions and results in the most relevant features to age-invariant face recognition. Finally, K-nearest neighbor and support vector machine are investigated for classification. Our experiments are performed on two standard face-aging datasets, namely FGNET and MORPH. Rank-1 recognition accuracy of the proposed system is 81.5% on FGNET and 96.5% on MORPH. Experimental results outperform the current state-of-the-art techniques on same data. These preliminary results show the promise of the proposed system for personal identification despite aging process.
Article
Full-text available
This paper presents a novel unmanned aerial vehicle tracking framework. First, hierarchical convolutional neural network features are used to track the object independently in a correlation filter tracking framework. Second, a stability criterion is proposed, which is based on the variance of tracking results of each layer. Next, tracking result is adaptively fused via the variance. Meanwhile, the criterion can be used to measure the quality of tracking results. A saliency detection method is utilized to generate candidate regions when tracking failure occurs. By virtue of this method, our tracking algorithm can robustly cope with appearance changes and prevent drifting issues. Experimental results show that our proposed tracking algorithm performs favorably against state-of-the-art methods on two benchmark datasets.
Article
Full-text available
In fluorescence microscopy imaging, noise is a very usual phenomenon. To some extent, it can be suppressed by increasing the amount of the photon exposure; however, it is not preferable since this may not be tolerated by the subjected specimen. Thus, a sophisticated computational method is needed to denoise each acquired micrograph, so that they become more adequate for further feature extraction and image analysis. However, apart from the difficulties of the denoising problem itself, one main challenge is that the absence of the ground-truth images makes the data-driven techniques less applicable. In order to tackle this challenge, we suggest to tailor a dataset by handpicking images from unrelated source datasets. Our tailoring strategy involves exploring some low-level view-based features of the candidate images, and their similarities to those of the fluorescence microscopy images. We pretrain and fine-tune the well-known feed-forward denoising convolutional neural networks (DnCNNs) on our tailored dataset and a very limited amount of fluorescence images, respectively to ensure both the diversity and the content-awareness. The quantitative and visual experimentation show that our approach is able to curate a dataset, which is significantly superior to the arbitrarily chosen source images, and well-approximates to the fluorescence images. Moreover, the combination of the tailored dataset with a few fluorescence data through the use of fine tuning offers a good balance between the generalization capability and the content awareness, on the majority of considered scenarios.
Article
This paper proposes a general conversion theory to reveal the relations between convolutional neural network (CNN) and spiking convolutional neural network (spiking CNN) from structure to information processing. Based on the conversion theory and the statistical features of the activations distribution in CNN, we establish a deterministic conversion rule to convert CNNs into spiking CNNs with definite conversion procedure and the optimal setting of all parameters. Included in conversion rule, we propose a novel “n-scaling” weight mapping method to realize high-accuracy, low-latency and power efficient object classification on hardware. For the first time, the minimum dynamic range of spiking neuron’s membrane potential is studied to help to balance the trade-off between representation range and precise of the data type adopted by dedicated hardware when spiking CNNs run on it. The simulation results demonstrate that the converted spiking CNNs perform well on MNIST, SVHN and CIFAR-10 datasets. The accuracy loss over three datasets is no more than 0.4%. 39% of processing time is shortened at best, and less power consumption is benefited from lower latency achieved by our conversion rule. Furthermore, the results of noise robustness experiments indicate that spiking CNN inherits the robustness from its corresponding CNN.
Article
Hashing has been drawing increasing attention in the task of large-scale image retrieval owing to its storage and computation efficiency, especially the recent asymmetric deep hashing methods. These approaches treat the query and database in an asymmetric way and can take full advantages of the whole training data. Though achieved the state-of-the-art performance, asymmetric deep hashing methods still suffer from the large quantization error and efficiency problem on large-scale dataset due to the tight coupling between query and database. In this paper, we propose a novel asymmetric hashing method, called D eep U ncoupled D iscrete H ashing (DUDH), for large-scale approximate nearest neighbor search. Instead of directly preserving the similarity between query and database, DUDH first exploits a small similarity-transfer image set to transfer the underlying semantic structures from database to query, and implicitly keep the desired similarity. As a result, the large similarity matrix is decomposed into two relatively small ones and query is decoupled from database. Then both database codes and similarity-transfer codes are directly learned during optimization. The quantization error of DUDH only exists in the process of preserving similarity between query and similarity-transfer set. By uncoupling query from database, the training cost of optimizing CNN model for query is no longer related to the size of database. Besides, to further accelerate the training process, we propose to optimize the similarity-transfer codes with a constant-approximation solution. In doing so, the training cost of optimizing similarity-transfer codes can be almost ignored. Extensive experiments on four widely used image retrieval benchmarks demonstrate that DUDH can achieve state-of-the-art retrieval performance with remarkable training cost reduction ( $$30\%-50\%$$ relative).
Article
Human fall is one of the very critical health issues, especially for elders and disabled people living alone. The number of elder populations is increasing steadily worldwide. Therefore, human fall detection is becoming an effective technique for assistive living for those people. For assistive living, deep learning and computer vision have been used largely. In this review article, we discuss deep learning (DL)-based state-of-the-arts non-intrusive (vision-based) fall detection techniques. We also present a survey on fall detection benchmark datasets. For a clear understanding, we briefly discuss different metrics which are used to evaluate the performance of the fall detection systems. This article also gives a future direction on vision-based human fall detection techniques.
Article
Representational Similarity Analysis (RSA) has emerged as a popular method for relating representational spaces from human brain activity, behavioral data, and computational models. RSA is based on the comparison of representational (dis-)similarity matrices (RDM or RSM), which characterize the pairwise (dis-)similarities of all conditions across all features (e.g. fMRI voxels or units of a model). However, classical RSA treats each feature as equally important. This ‘equal weights’ assumption contrasts with the flexibility of multivariate decoding, which reweights individual features for predicting a target variable. As a consequence, classical RSA may lead researchers to underestimate the correspondence between a model and a brain region and, in case of model comparison, may lead them to select an inferior model. The aim of this work is twofold: First, we sought to broadly test feature-reweighted RSA (FR-RSA) applied to computational models and reveal the extent to which reweighting model features improves RSM correspondence and affects model selection. Previous work suggested that reweighting can improve model selection in RSA but it has remained unclear to what extent these results generalize across datasets and data modalities. To draw more general conclusions, we utilized a range of publicly available datasets and three popular deep neural networks (DNNs). Second, we propose voxel-reweighted RSA, a novel use case of FR-RSA that reweights fMRI voxels, mirroring the rationale of multivariate decoding of optimally combining voxel activity patterns. We found that reweighting individual model units markedly improved the fit between model RSMs and target RSMs derived from several fMRI and behavioral datasets and affected model selection, highlighting the importance of considering FR-RSA. For voxel-reweighted RSA, improvements in RSM correspondence were even more pronounced, demonstrating the utility of this novel approach. We additionally show that classical noise ceilings can be exceeded when FR-RSA is applied and propose an updated approach for their computation. Taken together, our results broadly validate the use of FR-RSA for improving the fit between computational models, brain, and behavioral data, possibly allowing us to better adjudicate between competing computational models. Further, our results suggest that FR-RSA applied to brain measurement channels could become an important new method to assess the correspondence between representational spaces.
Article
Distributed acoustic sensing (DAS), a new geophone for effective acquisition of vertical seismic profile, has several advantages, including low-cost, high-precision, and high temperature-resistance. However, it is susceptible to various noises that can contaminate the desired weak signals, thus denoising is an essential and independent procedure in DAS processing. Therefore, we propose the transform learning method to train a denoising model in the temporal-frequency domain considering that DAS noise is multiple and performs complexly in the time domain. Initially, synchrosqueezing transform is used to generate DAS data effective sparse representation in the frequency domain where the data remains more centered and uniform. Further, a redesigned convolutional neural network based on residual learning is built to extract different features between the noise and signal components, thus separating them. Finally, the denoised results are obtained by inversely transforming the denoised components into the time domain. Three factors improve the denoising performance of our method: 1. It can directly provide high-dimensional features for training, reducing data-dependency to an extent. 2. The designed network calculates a powerful nonlinear mapping between the additive noise and input noisy component, significantly reducing the training difficulties. 3. A new objective function in the synchrosqueezing frequency domain is designed for nonlinear mapping optimization. Both synthetic and real examples can demonstrate the method's effectiveness in denoising and improving the signal-to-noise ratio of DAS. Further, the denoised results can contribute to a more accurate velocity analysis.
Article
Scene parsing is the problem of densely labeling every pixel in an image with a meaningful class label. Driven by powerful methods, remarkable progress has been achieved in scene parsing over a short period of time. With growing data, non-parametric scene parsing or label transfer approach has emerged as an exciting and rapidly growing research area within Computer Vision. This paper constitutes a first survey examining label transfer methods through the lens of non-parametric, data-driven philosophy. We provide insights on non-parametric system design and its working stages, i.e. algorithmic components such as scene retrieval, scene correspondence, contextual smoothing, etc. We propose a synthetic categorization of all the major existing methods, discuss the necessary background, the design choices, followed by an overview of the shortcomings and challenges for a better understanding of label transfer. In addition, we introduce the existing standard benchmark datasets, the evaluation metrics, and the comparisons of model-based and data-driven methods. Finally, we provide our recommendations and discuss the current challenges and promising research directions in the field.
Article
Correlation Filters (CFs) have shown outstanding performance in tracking, but are subject to unwanted boundary effects. Spatial regularization (SR) is widely used as an efficient method to alleviate the boundary effects. However, spatial regularization is almost handcrafted and fixed during tracking process, which cannot handle the diversity of objects and the complexity of motion. Furthermore, the rich spatio-temporal correlations among multiple targets of interest cannot be fully exploited. Herein, we propose a spatio-temporal Gaussian scale mixture model (ST-GSM) for correlation-filter-based visual tracking. In our Gaussian scale mixture (GSM) model, each correlation filter coefficient is decomposed into the product of a positive scalar multiplier with sparsity and a Gaussian random variable. The reliable components of the Gaussian random variable can be adaptively selected based on the positive multipliers, aiming at alleviating the notorious boundary effects. To exploit the temporal consistency between adjacent frames, nonzero-means GSM models are developed to characterize the temporal correlations. Specifically, the filter coefficient obtained in the previous frame is used as the mean prior for the current frame. The spatial correlations among filter coefficients have been considered in the structured GSM model, thereby further improving the tracking performance. Experimental results show that the proposed model can significantly improve the performance of CF-based trackers.
Article
In recent years, computed tomography (CT) has been widely used in various clinical diagnosis. Given potential health risks bring by the X-ray radiation, the major objective of the current research is to achieve high-quality CT imaging while reducing X-ray radiation. However, most existing studies on low-dose CT image super-resolution reconstruction do not focus on the interaction between the denoising task and the super-resolution task. In this paper, we propose a dual-channel joint learning framework to accurately reconstruct high-resolution CT images from low-resolution CT images. Unlike the previous cascaded models which directly combine the denoising network and the super-resolution network, our method can process the denoising reconstruction and the super-resolution reconstruction in parallel. Additionally, we design a filter gate module that can filter features from the denoising branch and highlight important features which can benefit the super-resolution task. We evaluate the performance of our method in medical image enhancement by testing on the 2016 Low-Dose CT Grand Challenge dataset and the piglet dataset. The experimental results show that the proposed network is superior to other state-of-the-art methods in terms of both peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). We also demonstrate that our method can better remove noise and recover details. Furthermore, the method achieves competitive results not only for super-resolution reconstruction of low-dose CT, but also for super-resolution reconstruction of sparse-view CT.
Article
Deep learning (DL) technologies have recently shown great potential in emotion recognition based on electroencephalography (EEG). However, existing DL-based EEG emotion recognition methods are built on single-task learning, \emph{i.e.}, learning arousal, valence, and dominance individually, which may ignore the complementary information of different tasks. In addition, single-task learning involves a new round of training every time a new task appears, which is time consuming. To this end, we propose a novel method for EEG-based emotion recognition based on multi-task learning with capsule network (CapsNet) and attention mechanism. First, multi-task learning can learn multiple tasks simultaneously while exploiting commonalities and differences across tasks, it can also obtain more data from different tasks, which can improve generalization and robustness. Second, the innovative structure of the CapsNet enables it to effectively characterize the intrinsic relationship among various EEG channels. Finally, the attention mechanism can change the weight of different channels to extract important information. In the DEAP dataset, the average accuracy reached 97.25%, 97.41%, and 98.35% on arousal, valence, and dominance, respectively. In the DREAMER dataset, average accuracy reached 94.96%, 95.54%, and 95.52% on arousal, valence, and dominance, respectively. Experimental results demonstrate the efficiency of the proposed method for EEG emotion recognition.
Chapter
Time-resolved imaging becomes popular in radiotherapy in that it significantly reduces blurring artifacts in volumetric images reconstructed from a set of 2D X-ray projection data. We aim at developing a neural network (NN) based machine learning algorithm that allows for reconstructing an instantaneous image from a single projection. In our approach, each volumetric image is represented as a deformation of a chosen reference image, in which the deformation is modeled as a linear combination of a few basis functions through principal component analysis (PCA). Based on this PCA deformation model, we train an ensemble of neural networks to find a mapping from a projection image to PCA coefficients. For image reconstruction, we apply the learned mapping on an instantaneous projection image to obtain the PCA coefficients, thus getting a deformation. Then, a volumetric image can be reconstructed by applying the deformation on the reference image. Experimentally, we show promising results on a set of simulated data.
Article
Full-text available
Stimuli are represented in the brain by the collective population responses of sensory neurons, and an object presented under varying conditions gives rise to a collection of neural population responses called an ‘object manifold’. Changes in the object representation along a hierarchical sensory system are associated with changes in the geometry of those manifolds, and recent theoretical progress connects this geometry with ‘classification capacity’, a quantitative measure of the ability to support object classification. Deep neural networks trained on object classification tasks are a natural testbed for the applicability of this relation. We show how classification capacity improves along the hierarchies of deep neural networks with different architectures. We demonstrate that changes in the geometry of the associated object manifolds underlie this improved capacity, and shed light on the functional roles different levels in the hierarchy play to achieve it, through orchestrated reduction of manifolds’ radius, dimensionality and inter-manifold correlations.
Article
Conventional discriminative-correlation-filter-based (DCF-based) visual tracking methods always update model at a fixed frequency and learning rate. Without evaluating the tracking confidence scores, the response map generated by filter is the only evidence for locating. Thus, most of the existing DCF-based methods suffer from the model contamination caused by drastic appearance variations, which leads to tracking drift even failure. And excessively frequent update will increase the computational redundancy and risk of over-fitting. In addition, these methods can’t recover target from heavy occlusion neither. Based on the observation that the shape of response maps reflects the matching degree between filter and target, we design and train a small-scale binary network named as response map analysis network (RAN) to evaluate the confidence scores of filters. Further, we propose to learn multiple filters to exploit different kinds of features, and adaptively adjust the update parameters according to the corresponding confidence scores. Moreover, we build a simple occlusion event model to detect heavy occlusion and recover target. Extensive experimental results validate the effectiveness of RAN and demonstrate that the proposed tracker performs favorably against other state-of-the-art (SOTA) DCF-based trackers in terms of precision, overlap rate and efficiency.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in-terfaced to third-party software thanks to Lua's light interface.
Article
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. A particularly significant one is data augmentation, which achieves a boost in performance in shallow methods analogous to that observed with CNN-based methods. Finally, we are planning to provide the configurations and code that achieve the state-of-the-art performance on the PASCAL VOC Classification challenge, along with alternative configurations trading-off performance, computation speed and compactness.
Article
The matrix differential calculus is applied for the first time to a quantum chemical problem via new matrix derivations of integral formulas and gradients for Hamiltonian matrix elements in a basis of correlated Gaussian functions. Requisite mathematical background material on Kronecker products, Hadamard products, the vec and vech operators, linear structures, and matrix differential calculus is presented. New matrix forms for the kinetic and potential energy operators are presented. Integrals for overlap, kinetic energy, and potential energy matrix elements are derived in matrix form using matrix calculus. The gradient of the energy functional with respect to the correlated Gaussian exponent matrices is derived. Burdensome summation notation is entirely replaced with a compact matrix notation that is both theoretically and computationally insightful. © 1996 John Wiley & Sons, Inc.
Theano: a CPU and GPU math expression compiler
• James Bergstra
• Olivier Breuleux
• Frédéric Bastien
• Pascal Lamblin
• Razvan Pascanu
• Guillaume Desjardins
• Joseph Turian
• David Warde-Farley
• Yoshua Bengio
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.