Article

A guide to convolution arithmetic for deep learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In a 2D convolution layer, a filter or kernel is applied to a 2D image, performing a dot product at each position. 33 The kernel size (k) impacts the detail level captured, whereas the stride (s) affects the kernel shift amount. Padding (p), set to "same" in this study, ensures that the output feature map matches the input image dimensions, permitting edge-based convolution operations. ...
... 2D max pooling is a feature map reduction method where a rectangular kernel selects maximum values within regions, creating a smaller feature map. 33 The kernel size (k) defines the sliding window size over the input, and the stride (s)-in our study equal to k to simplify the computation-controls the window's movement. For a given position ði; jÞ, the topmost and leftmost pixels from the previous layer's TRF can be accessed from the T ðd−1Þ tensor at the index of ði · k; j · kÞ, whereas the bottom-most and rightmost pixels can be accessed at the index of ði · k þ k − 1; j · k þ k − 1Þ. ...
... The stride s determines the amount of shift in the output feature map for each kernel application. 33 When the stride is set to k, the size of the output feature map is equal to the size of the input feature map multiplied by the stride. However, when the stride s is different from the kernel size k, there may be overlaps in the values of the output feature map. ...
Article
Purpose: Medical image segmentation is a critical task in healthcare applications, and U-Nets have demonstrated promising results in this domain. We delve into the understudied aspect of receptive field (RF) size and its impact on the U-Net and attention U-Net architectures used for medical imaging segmentation. Approach: We explore several critical elements including the relationship among RF size, characteristics of the region of interest, and model performance, as well as the balance between RF size and computational costs for U-Net and attention U-Net methods for different datasets. We also propose a mathematical notation for representing the theoretical receptive field (TRF) of a given layer in a network and propose two new metrics, namely, the effective receptive field (ERF) rate and the object rate, to quantify the fraction of significantly contributing pixels within the ERF against the TRF area and assessing the relative size of the segmentation object compared with the TRF size, respectively. Results: The results demonstrate that there exists an optimal TRF size that successfully strikes a balance between capturing a wider global context and maintaining computational efficiency, thereby optimizing model performance. Interestingly, a distinct correlation is observed between the data complexity and the required TRF size; segmentation based solely on contrast achieved peak performance even with smaller TRF sizes, whereas more complex segmentation tasks necessitated larger TRFs. Attention U-Net models consistently outperformed their U-Net counterparts, highlighting the value of attention mechanisms regardless of TRF size. Conclusions: These insights present an invaluable resource for developing more efficient U-Net-based architectures for medical imaging and pave the way for future exploration of other segmentation architectures. A tool is also developed, which calculates the TRF for a U-Net (and attention U-Net) model and also suggests an appropriate TRF size for a given model and dataset.
... We introduce here a connection between convolutional operators and pattern extraction by generalizing the concept of a window to that of a receptive field. In convolutional neural networks theory, the receptive field refers to the spatial extent of input data that influences the activation of a particular neuron in the network [37]. Applied to our setting, a receptive field can be viewed as a generalized time series subsequence. ...
... By introducing gaps between observations, dilation expands the receptive field and facilitates the incorporation of distant information. By iterating over the signal x, v receptive fields can be extracted [37], where: ...
... given that for each cell of M , we have to aggregate q elements. The highest number of receptive fields we can have is when d = 1 and s = 1 [37]. Therefore, the complexity simplifies to O((m − w + 1) · w). ...
Article
Full-text available
The current trend in the literature on Time Series Classification is to develop increasingly accurate algorithms by combining multiple models in ensemble hybrids, representing time series in complex and expressive feature spaces, and extracting features from different representations of the same time series. As a consequence of this focus on predictive performance, the best time series classifiers are black-box models, which are not understandable from a human standpoint. Even the approaches that are regarded as interpretable, such as shapelet-based ones, rely on randomization to maintain computational efficiency. This poses challenges for interpretability, as the explanation can change from run to run. Given these limitations, we propose the Bag-Of-Receptive-Field (BORF), a fast, interpretable, and deterministic time series transform. Building upon the classical Bag-Of-Patterns, we bridge the gap between convolutional operators and discretization, enhancing the Symbolic Aggregate Approximation (SAX) with dilation and stride, which can more effectively capture temporal patterns at multiple scales. We propose an algorithmic speedup that reduces the time complexity associated with SAX-based classifiers, allowing the extension of the Bag-Of-Patterns to the more flexible Bag-Of-Receptive-Fields, represented as a sparse multivariate tensor. The empirical results from testing our proposal on more than 150 univariate and multivariate classification datasets demonstrate good accuracy and great computational efficiency compared to traditional SAX-based methods and state-of-the-art time series classifiers, while providing easy-to-understand explanations.
... Adversarial samples are a type of attack data targeting machine learning models, especially in convolutional neural networks (CNNs) [41] for multi-class image classification. This attack method has a high degree of threat, as the attacker adds visually imperceptible interference to the original image, which may lead to image classification failure. ...
... In this research, we employed six distinct image classification network architectures: convolutional neural network (CNN) [41], deep neural network (DNN) [46], VGG16 [47], ResNet50 [48], EfficientNet [49,50], and WideResNet [51]. Figure 1 illustrates the architectural layout of the CNN. A convolutional neural network (CNN) comprises three convolutional layers and a single fully connected layer. ...
... The experiment was conducted on a machine running the Windows 11 operating system, equipped with a 14-core 2.30 GHz CPU, NVIDIA GeForce RTX 3060 Laptop GPU, and 16 GB memory. On the MNIST dataset, we trained two different classifiers: convolutional neural network (CNN) [41] and deep neural network (DNN) [46]. For the parameter adjustment of each layer, we adopted the Adam optimization algorithm and a fixed learning rate of 0.01. ...
Article
Full-text available
The security and privacy of a system are urgent issues in achieving secure and efficient learning-based systems. Recent studies have shown that these systems are susceptible to subtle adversarial perturbations applied to inputs. Although these perturbations are difficult for humans to detect, they can easily mislead deep learning classifiers. Noise injection, as a defense mechanism, can offer a provable defense against adversarial attacks by reducing sensitivity to subtle input changes. However, these methods face issues of computational complexity and limited adaptability. We propose a multilayer filter defense model, drawing inspiration from filter-based image denoising techniques. This model inserts a filtering layer after the input layer and before the convolutional layer, and incorporates noise injection techniques during the training process. This model substantially enhances the resilience of image classification systems to adversarial attacks. We also investigated the impact of various filter combinations, filter area sizes, standard deviations, and filter layers on the effectiveness of defense. The experimental results indicate that, across the MNIST, CIFAR10, and CIFAR100 datasets, the multilayer filter defense model achieves the highest average accuracy when employing a double-layer Gaussian filter (filter area size of 3×3, standard deviation of 1). We compared our method with two filter-based defense models, and the experimental results demonstrated that our method attained an average accuracy of 71.9%, effectively enhancing the robustness of the image recognition classifier against adversarial attacks. This method not only performs well on small-scale datasets but also exhibits robustness on large-scale datasets (miniImageNet) and modern models (EfficientNet and WideResNet).
... The U-net and DeeplabV3plus algorithms use the ResNet18 model as the feature extraction backbone to enhance the extraction of feature information. Transposed convolution was chosen for the upsampling process in semantic segmentation, which could effectively merge low-level local features with high-level abstract features, aiding in the recovery of more precise object boundaries and internal structure [59]. ...
... Remote Sens. 2024, 16, x FOR PEER REVIEW 7 of 26 could effectively merge low-level local features with high-level abstract features, aiding in the recovery of more precise object boundaries and internal structure [59]. In the subsequent step, the semantic segmentation results were used as a mask of the CHM and then the WST algorithm was applied to further distinguish the connected tree crowns and extract the tree crown boundaries. ...
Article
Full-text available
Natural secondary forests play a crucial role in global ecological security, climate change mitigation, and biodiversity conservation. However, accurately delineating individual tree crowns and identifying tree species in dense natural secondary forests remains a challenge. This study combines deep learning with traditional image segmentation methods to improve individual tree crown detection and species classification. The approach utilizes hyperspectral, unmanned aerial vehicle laser scanning data, and ground survey data from Maoershan Forest Farm in Heilongjiang Province, China. The study consists of two main processes: (1) combining semantic segmentation algorithms (U-Net and Deeplab V3 Plus) with watershed transform (WTS) for tree crown detection (U-WTS and D-WTS algorithms); (2) resampling the original images to different pixel densities (16 × 16, 32 × 32, and 64 × 64 pixels) and inputting them into five 3D-CNN models (ResNet10, ResNet18, ResNet34, ResNet50, VGG16). For tree species classification, the MSFB combined with the CNN models were used. The results show that the U-WTS algorithm achieved a recall of 0.809, precision of 0.885, and an F-score of 0.845. ResNet18 with a pixel density of 64 × 64 pixels achieved the highest overall accuracy (OA) of 0.916, an improvement of 0.049 over the original images. After incorporating MSFB, the OA improved by approximately 0.04 across all models, with only a 6% increase in model parameters. Notably, the floating-point operations (FLOPs) of ResNet18 + MSFB were only one-eighth of those of ResNet18 with 64 × 64 pixels, while achieving similar accuracy (OA: 0.912 vs. 0.916). This framework offers a scalable solution for large-scale tree species distribution mapping and forest resource inventories.
... with norm being a batch normalization operation [38], TrConv2D being a transposed convolution operation [39] and concat being a concatenation of both feature vectors. The neighbourhood distance vector p ′ i l is computed ...
... The downsampling and upsamplinbg operations are based on convolution and transposed convolution [39] layers followed by a batch normalization layer [38]. ...
Article
Full-text available
Quality inspection is an industrial field with a growing interest in anomaly detection research. An anomaly in an image can either be structural or logical. While structural anomalies lie on the image objects, challenging logical anomalies are hidden in the global relations between the image components. The proposed approach, Vision Graph based Logical Anomaly Detection (ViGLAD), uses the graph representation of an image for logical anomaly detection. Defining an image as a structure of nodes and edges leverages new possibilities for detecting hidden logical anomalies by introducing vision graph autoencoders. Our experiments on public datasets show that using vision graphs enhances the performance of state-of-the-art teacher-student-autoencoder neural networks in logical anomaly detection while achieving robust results in structural anomaly detection.
... Stride is used to make the computation more efficient especially in large feature maps, while padding enables the perimeter pixels to be used in training. Figure 3, which is a good visualization of convolutional operations from Dumoulin and Visin (2018), shows how padding and strides are done. The input feature map, colored in blue, is surrounded by padded cells, usually valued at 0, which is shown as broken lines. ...
... Visualization of Padding and Strides(Dumoulin and Visin, 2018) ...
... Downsampling in the U-shaped network is performed using a three-dimensional maximum pooling layer, whereas upsampling is achieved using a three-dimensional transposed convolution layer. In contrast to conventional upsampling approaches like nearest neighbor interpolation and bilinear interpolation, the three-dimensional transposed convolution layer is equipped with learnable parameters (Dumoulin and Visin, 2016) and can dynamically adjust the upsampling parameters based on the network. Lastly, the output feature cube from the deep feature extraction module is transformed into a feature cube with 31 channels using a three-dimensional convolution operation. ...
Article
Full-text available
Pine wilt disease (PWD) is a highly destructive infectious disease that affects pine forests. Therefore, an accurate and effective method to monitor PWD infection is crucial. However, the majority of existing technologies can detect PWD only in the later stages. To curb the spread of PWD, it is imperative to develop an efficient method for early detection. We presented an early stage detection method for PWD utilizing UAV remote sensing, hyperspectral image reconstruction, and SVM classification. Initially, employ UAV to capture RGB remote sensing images of pine forests, followed by labeling infected plants using these images. Hyperspectral reconstruction networks, including HSCNN+, HRNet, MST++, and a self-built DW3D network, were employed to reconstruct the RGB images obtained from remote sensing. This resulted in hyperspectral images in the 400-700nm range, which were used as the dataset of early PWD detection in pine forests. Spectral reflectance curves of infected and uninfected plants were extracted. SVM algorithms with various kernel functions were then employed to detect early pine wilt disease. The results showed that using SVM for early detection of PWD infection based on reconstructed hyperspectral images achieved the highest accuracy, enabling the detection of PWD in its early stage. Among the experiments, MST++, DW3D, HRNet, and HSCNN+ were combined with Poly kernel SVM performed the best in terms of cross-validation accuracy, achieving 0.77, 0.74, 0.71, and 0.70, respectively. Regarding the reconstruction network parameters, the DW3D network had only 0.61M parameters, significantly lower than the MST++ network, which had the highest reconstruction accuracy with 1.6M parameters. The accuracy was improved by 27% compared to the detection results obtained using RGB images. This paper demonstrated that the hyperspectral reconstruction-poly SVM model could effectively detect the Early stage of PWD. In comparison to UAV hyperspectral remote sensing methods, the proposed method in this article offers a same precision, but a higher operational efficiency and cost-effectiveness. It also enables the detection of PWD at an earlier stage compared to RGB remote sensing, yielding more accurate and reliable results.
... The standard convolution layer used in DNNs [28] utilizes multiple channels and kernels, which can be also thought of as FIR filters. Time-domain PCM audio signals are typically treated as 1-D time series signals, i.e. the kernels are one-dimensional. ...
Preprint
Full-text available
A multichannel extension to the RVQGAN neural coding method is proposed, and realized for data-driven compression of third-order Ambisonics audio. The input- and output layers of the generator and discriminator models are modified to accept multiple (16) channels without increasing the model bitrate. We also propose a loss function for accounting for spatial perception in immersive reproduction, and transfer learning from single-channel models. Listening test results with 7.1.4 immersive playback show that the proposed extension is suitable for coding scene-based, 16-channel Ambisonics content with good quality at 16 kbit/s.
... Consequently, the reordered representation has a shape of 1 × 1 × 8 × 8 × 512. This step is followed by the application of a transposed convolution operation [55], designed to increase the height and width of the representation to match those of the input data. More specifically, for the ViT and MLP-Mixer models trained on the CIFAR-10 dataset, the initial patch size is 4 × 4. Therefore, the output of the transposed convolution operation is shaped as ...
Article
Full-text available
Deep learning has made significant strides, driving advances in areas like computer vision, natural language processing, and autonomous systems. In this paper, we further investigate the implications of the role of additive shortcut connections, focusing on models such as ResNet, Vision Transformers (ViTs), and MLP-Mixers, given that they are essential in enabling efficient information flow and mitigating optimization challenges such as vanishing gradients. In particular, capitalizing on our recent information bottleneck approach, we analyze how additive shortcuts influence the fitting and compression phases of training, crucial for generalization. We leverage Z-X and Z-Y measures as practical alternatives to mutual information for observing these dynamics in high-dimensional spaces. Our empirical results demonstrate that models with identity shortcuts (ISs) often skip the initial fitting phase and move directly into the compression phase, while non-identity shortcut (NIS) models follow the conventional two-phase process. Furthermore, we explore how IS models are still able to compress effectively, maintaining their generalization capacity despite bypassing the early fitting stages. These findings offer new insights into the dynamics of shortcut connections in neural networks, contributing to the optimization of modern deep learning architectures.
... The Unet model's decoder involved a series of upsampling stages (s5 to s8), each equipped with a ReLU activation 64 and a transposed 4 × 4 convolutional layer with a stride of 2 (ref. 65) followed by batch normalization. This 4 × 4 convolutional process effectively doubled the spatial dimensions of the feature map while halving the channel count. ...
Article
Full-text available
Important progress has been made in micromagnetics, driven by its wide-ranging applications in magnetic storage design. Numerical simulation, a cornerstone of micromagnetics research, relies on first-principles rules to compute the dynamic evolution of micromagnetic systems using the renowned Landau–Lifshitz–Gilbert equation, named after Landau, Lifshitz and Gilbert. However, these simulations are often hindered by their slow speeds. Although fast Fourier transformation calculations reduce the computational complexity to O(Nlog(N)), it remains impractical for large-scale simulations. Here we introduce NeuralMAG, a deep learning approach to micromagnetic simulation. Our approach follows the Landau–Lifshitz–Gilbert iterative framework but accelerates computation of demagnetizing fields by employing a U-shaped neural network. This neural network architecture comprises an encoder that extracts aggregated spins at various scales and learns the local interaction at each scale, followed by a decoder that accumulates the local interactions at different scales to approximate the global convolution. This divide-and-accumulate scheme achieves a time complexity of O(N), notably enhancing the speed and feasibility of large-scale simulations. Unlike existing neural methods, NeuralMAG concentrates on the core computation—rather than an end-to-end approximation for a specific task—making it inherently generalizable. To validate the new approach, we trained a single model and evaluated it on two micromagnetics tasks with various sample sizes, shapes and material settings.
... Multiple variants have been derived from the basic convolution. TranslatedConv [56] improved the spatial resolution by learning kernel weights. AtrousConv [57] broadened the receptive fields for multi-scale feature representation. ...
Article
The infrared thermal imaging method can effectively identify and locate leakage sites using the temperature characteristics of pneumatic system leakage. However, it is limited by the tiny dimensions of the objects, color shifts of varying background temperatures, and indistinctness of feature details. To address these issues, we integrate Omni-Dimensional Dynamic Convolution (ODDC), Squeeze-and-Excitation (SE) attention module, and Normalized Gaussian Wasserstein Distance (NWD) with You Only Look Once (YOLO) into a framework, named ODSW-YOLO, for the precise localization of leakage. Specifically, ODDC is employed to adjust the kernel parameters for improving the accuracy of detecting small targets. To reduce the influence of varying background temperatures , SE attention is adopted to extract key features. NWD is used to enhance the feature details by checking the small feature changes within the detection window. Finally, extensive experiments show that ODSW-YOLO improves the detection accuracy from 0.677 to 0.755, which surpasses the baseline model (YOLOv5).
... Early semantic segmentation models based on CNN architecture exhibited low accuracy. However, fully convolutional networks (FCN) with decoder architecture [5] overcame the spatial information loss and improved semantic segmentation accuracy through transposed convolutional operations. Semantic segmentation for autonomous driving demands high resolution, fast inference, and accurate results. ...
Article
Full-text available
Semantic segmentation is crucial in autonomous driving because of its accurate identification and segmentation of objects and regions. However, there is a conflict between segmentation accuracy and real-time performance on embedded devices. We propose an efficient lightweight semantic segmentation network (DRMNet) to solve these problems. Employing a streamlined bilateral structure, the model encodes semantic and spatial paths, cross-fusing features during encoding, and incorporates unique skip connections to coordinate upsampling within the semantic pathway. We design a new self-calibrated aggregate pyramid pooling module (SAPPM) at the end of the semantic branch to capture more comprehensive multi-scale semantic information and balance its extraction and inference speed. Furthermore, we designed a new feature fusion module, which guides the fusion of detail features and semantic features through attention perception, alleviating the problem of semantic information quickly covering spatial detail information. Experimental results on the CityScapes, CamVid, and NightCity datasets demonstrate the effectiveness of DRMNet. On a 2080Ti GPU, DRMNet achieves 78.6% mIoU at 88.3 FPS on the CityScapes dataset, 78.9% mIoU at 149 FPS on the CamVid dataset, and 53.5% mIoU at 160.4 FPS on the NightCity dataset. These results highlight the model’s ability to balance accuracy and real-time performance better, making it suitable for embedded devices in autonomous driving applications.
... • Attack on a single gradient-based explanation. To invert the target model M t , a transposed convolutional neural network (TCNN) [114] is devised to reconstruct a two-dimensional image x r from the one-dimensional prediction vector y t provided by M t . The TCNN minimizes the mean squared error (MSE) loss to approximate the original image. ...
Article
Full-text available
As the adoption of explainable AI (XAI) continues to expand, the urgency to address its privacy implications intensifies. Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations. This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures. Our contribution to this field comprises a thorough analysis of research papers with a connected taxonomy that facilitates the categorization of privacy attacks and countermeasures based on the targeted explanations. This work also includes an initial investigation into the causes of privacy leaks. Finally, we discuss unresolved issues and prospective research directions uncovered in our analysis. This survey aims to be a valuable resource for the research community and offers clear insights for those new to this domain. To support ongoing research, we have established an online resource repository, which will be continuously updated with new and relevant findings.
... Here (i, j) is a tuple of valid indices for the result, the output shape of I * K can be computed based on a number of parameters (Dumoulin and Visin 2018). ...
Preprint
Full-text available
We develop a method for the efficient verification of neural networks against convolutional perturbations such as blurring or sharpening. To define input perturbations we use well-known camera shake, box blur and sharpen kernels. We demonstrate that these kernels can be linearly parameterised in a way that allows for a variation of the perturbation strength while preserving desired kernel properties. To facilitate their use in neural network verification, we develop an efficient way of convolving a given input with these parameterised kernels. The result of this convolution can be used to encode the perturbation in a verification setting by prepending a linear layer to a given network. This leads to tight bounds and a high effectiveness in the resulting verification step. We add further precision by employing input splitting as a branch and bound strategy. We demonstrate that we are able to verify robustness on a number of standard benchmarks where the baseline is unable to provide any safety certificates. To the best of our knowledge, this is the first solution for verifying robustness against specific convolutional perturbations such as camera shake.
... Due to the relatively small dimensions (9 × 9) of the 3D features after 3D mapping, applying conventional convolution operations as used in image processing directly would result in an excessively large receptive field. Such approaches have failed to capture enough fine-grained features (Dumoulin and Visin, 2016;O'shea and Nash, 2015). There could be two potential solutions to this problem. ...
Article
Full-text available
Introduction EEG-based emotion recognition has gradually become a new research direction, known as affective Brain-Computer Interface (aBCI), which has huge application potential in human-computer interaction and neuroscience. However, how to extract spatio-temporal fusion features from complex EEG signals and build learning method with high recognition accuracy and strong interpretability is still challenging. Methods In this paper, we propose a hybrid attention spatio-temporal feature fusion network for EEG-based emotion recognition. First, we designed a spatial attention feature extractor capable of merging shallow and deep features to extract spatial information and adaptively select crucial features under different emotional states. Then, the temporal feature extractor based on the multi-head attention mechanism is integrated to perform spatio-temporal feature fusion to achieve emotion recognition. Finally, we visualize the extracted spatial attention features using feature maps, further analyzing key channels corresponding to different emotions and subjects. Results Our method outperforms the current state-of-the-art methods on two public datasets, SEED and DEAP. The recognition accuracy are 99.12% ± 1.25% (SEED), 98.93% ± 1.45% (DEAP-arousal), and 98.57% ± 2.60% (DEAP-valence). We also conduct ablation experiments, using statistical methods to analyze the impact of each module on the final result. The spatial attention features reveal that emotion-related neural patterns indeed exist, which is consistent with conclusions in the field of neurology. Discussion The experimental results show that our method can effectively extract and fuse spatial and temporal information. It has excellent recognition performance, and also possesses strong robustness, performing stably across different datasets and experimental environments for emotion recognition.
... As summarized in Table 8, the residual block architecture utilizes pre-layer normalization [4] and linear attention [58] components. The downsampling path uses convolutional layers, while the upsampling path employs transposed convolutional layers [18]. The network processing Raw images before the diffusion model shares a similar architecture to the denoising U-Net, except for three variations: 4 initial input channels, 3 final output channels, and channels 16 times smaller than the denoising U-Net, without incorporating time embedding. ...
Preprint
Full-text available
Most existing super-resolution methods and datasets have been developed to improve the image quality in well-lighted conditions. However, these methods do not work well in real-world low-light conditions as the images captured in such conditions lose most important information and contain significant unknown noises. To solve this problem, we propose a SRRIIE dataset with an efficient conditional diffusion probabilistic models-based method. The proposed dataset contains 4800 paired low-high quality images. To ensure that the dataset are able to model the real-world image degradation in low-illumination environments, we capture images using an ILDC camera and an optical zoom lens with exposure levels ranging from -6 EV to 0 EV and ISO levels ranging from 50 to 12800. We comprehensively evaluate with various reconstruction and perceptual metrics and demonstrate the practicabilities of the SRRIIE dataset for deep learning-based methods. We show that most existing methods are less effective in preserving the structures and sharpness of restored images from complicated noises. To overcome this problem, we revise the condition for Raw sensor data and propose a novel time-melding condition for diffusion probabilistic model. Comprehensive quantitative and qualitative experimental results on the real-world benchmark datasets demonstrate the feasibility and effectivenesses of the proposed conditional diffusion probabilistic model on Raw sensor data. Code and dataset will be available at https://github.com/Yaofang-Liu/Super-Resolving
... In superresolution tasks, where low-resolution inputs of 28 � 28 pixels are upscaled to match the 224 � 224 dimensions required by the pretrained model, many existing methods employ techniques such as interpolation (Keys 1981), transposed convolution (Dumoulin and Visin 2016), and pixel shuffling (Shi et al. 2016). To ensure minimal modifications and maintain model efficiency, we implemented bicubic interpolation only at the initial stage, resizing the input images to the required dimensions without adding untrained parameters. ...
... A prime example is that of fully convolutional network (FCN) based methods that are currently the stateof-the-art in semantic image segmentation [7]. Most of these methods share the encoder-decoder design, which partitions the network into 2 symmetrical, opposing paths: the encoder transforms the input image into a low-dimensional representation based on a sequence of convolution and pooling operators, while the decoder up-samples the compact feature representation back into a full-sized image, again through the use of convolutions and transposed/fractional convolutions [53] in place of the pooling layers (see Fig. 8 for an example). Decoder architectures are often applied as generators within the generative adversarial network (GAN) framework. ...
Preprint
Full-text available
Instance segmentation is a core computer vision task with great practical significance. Recent advances, driven by large-scale benchmark datasets, have yielded good general-purpose Convolutional Neural Network (CNN)-based methods. Natural Resource Monitoring (NRM) utilizes remote sensing imagery with generally known scale and containing multiple overlapping instances of the same class, wherein the object contours are jagged and highly irregular. This is in stark contrast with the regular man-made objects found in classic benchmark datasets. We address this problem and propose a novel instance segmentation method geared towards NRM imagery. We formulate the problem as Bayesian maximum a posteriori inference which, in learning the individual object contours, incorporates shape, location, and position priors from state-of-the-art CNN architectures, driving a simultaneous level-set evolution of multiple object contours. We employ loose coupling between the CNNs that supply the priors and the active contour process, allowing a drop-in replacement of new network architectures. Moreover, we introduce a novel prior for contour shape, namely, a class of Deep Shape Models based on architectures from Generative Adversarial Networks (GANs). These Deep Shape Models are in essence a non-linear generalization of the classic Eigenshape formulation. In experiments, we tackle the challenging, real-world problem of segmenting individual dead tree crowns and delineating precise contours. We compare our method to two leading general-purpose instance segmentation methods - Mask R-CNN and K-net - on color infrared aerial imagery. Results show our approach to significantly outperform both methods in terms of reconstruction quality of tree crown contours. Furthermore, use of the GAN-based deep shape model prior yields significant improvement of all results over the vanilla Eigenshape prior.
... Moreover, the last layer is utilized to reconstruct the original image by exploiting three deconvolutional layers, followed by the ReLU and sigmoid activation functions. The deconvolutional layers inherit the input image's spatial resolution, enabling the capsule network to produce a more detailed reconstructed image [22]. The last layer is reshaped to the input image of a similar dimension (224 Â 224 Â 3 = 150528). ...
Article
Full-text available
Potato tuber is crucial as a primary vegetable food crop across the globe. However, its quality and quantity are constantly threatened by fungal blight diseases, particularly late and early blight, posing a significant risk to food security. To identify blight diseases, agrarians visually inspect potato leaf color variations, which is time-consuming and computationally expensive. To address this challenge, various advanced technologies, such as image processing, machine learning, and deep neural networks, have been widely applied in agricultural domains, especially for autonomous disease identification and classification. However, there is an urgent need to develop computational models that can rapidly and objectively detect these diseases, even in their early stages. Therefore, this paper introduces the “PotCapsNet" method that utilizes parallel atrous convolutional layers with varying dilation rates for multi-scale feature extraction, a shuffled convolutional block attention module for effective feature selection, and a capsule network for early disease classification. The effectiveness of the proposed model was validated using the publicly available PlantVillage and PLD datasets for multi-class potato leaf disease identification. The experimental findings show that the proposed method outperforms other methods in terms of accuracy, specificity, F1 score, and sensitivity. The proposed method attained average recognition accuracy of 97.81%\% and 97.95%\% on the PlantVillage and PLD datasets, respectively, the best among all the considered methods. The code and datasets of the proposed method are available at https://github.com/ersachingupta11/PotCapsNet.
... The decoder uses transpose convolutions (deconvolutions) to increase the image resolution and further refine the output segmentation mask [72]. The final output is a pixel-wise segmentation mask, where each pixel is assigned a class label based on the features extracted by the encoder and decoder. ...
Thesis
Full-text available
This dissertation applies computer vision techniques to enhance the diagnosis of canine eye diseases in veterinary ophthalmology. Leveraging deep learning models—specifically U-Net and GPT-4o—for image segmentation and diagnostic interpretation, we utilized the DogEyeSeg4 dataset of real-world clinical images, augmented with synthetic images to improve model robustness and generalization. U-Net(RSD), trained on this combined dataset, was employed for precise segmentation of canine eye symptoms such as corneal cloudiness, scleral redness, excessive tearing, and colored mass protrusion. We also trained individual binary segmentation models for each symptom using heatmaps from SSD eye detection to reduce false positives. While these binary models improved symptom isolation, they faced challenges with overlapping conditions and complexity. Ultimately, the multiclass U-Net(RSD) model provided better performance and efficiency. GPT-4o interpreted the segmented images, outperforming other large language models in generating accurate diagnostic suggestions, particularly when using segmentation masks from the adjusted U-Net with a ResNet backbone. Despite promising results, challenges remain in diagnosing subtle conditions like corneal ulcers. Future work includes expanding the dataset and symptom range, refining models, and integrating multimodal data for comprehensive diagnostics. These findings demonstrate the potential of AI-driven tools to revolutionize veterinary ophthalmology, offering more accurate and efficient diagnostic processes to improve animal care.
... Each experiment was run using two NVIDIA A100 GPUs with 40 GB RAM. We used the convolutional neural network (Dumoulin & Visin, 2018) architecture described in Elgammal et al. (2017) for the CAN but had to use more/less layers to produce higher/lower dimension images. The generator takes a 1 × 100 gaussian noise vector ∈ R 100 ∼ N (0, I) and maps it to a 4 × 4 × 2048 latent space, via a convolutional transpose layer with kernel size = 4 and stride =1, followed by 6 transpose convolutional layers each upscaling the height and width dimensions by two, and halving the channel dimension (for example one of these transpose convolutional layers would map R 4×4×2048 → R 8×8×1024 ) followed by batch normalization (Ioffe & Szegedy, 2015) and Leaky ReLU (Maas et al., 2013), and then one final convolutional transpose layer with output channels = 3 and tanh (Dubey et al., 2022) activation function. ...
Preprint
Full-text available
Teaching text-to-image models to be creative involves using style ambiguity loss. In this work, we explore using the style ambiguity training objective, used to approximate creativity, on a diffusion model. We then experiment with forms of style ambiguity loss that do not require training a classifier or a labeled dataset, and find that the models trained with style ambiguity loss can generate better images than the baseline diffusion models and GANs. Code is available at https://github.com/jamesBaker361/clipcreate.
... The receptive field is defined as the region in the input space that a particular feature of a convolutional neural network is examining [41]. It is a two-dimensional concept, so the receptive field is also two-dimensional. ...
Article
Full-text available
Machine learning systems, particularly in the domain of image recognition, are susceptible to adversarial perturbations applied to input data. These perturbations, while imperceptible to humans, have the capacity to easily deceive deep learning classifiers. Current defense methods for image recognition focus on using diffusion models and their variants. Due to the depth of diffusion models and the large amount of computations generated during each inference process, the GPU and storage performance of the device are extremely high. To address this problem, we propose a new defense-based non-overlapping image compression filter for image recognition classifiers against adversarial attacks. This method inserts a non-overlapping image compression filter before the classifier to make the results of the classifier invariant under subtle changes in images. This method does not weaken the adversarial robustness of the model and can reduce the computational cost during the training process of the image classification model. In addition, our method can be easily integrated with existing image classification training frameworks with only some minor adjustments. We validate our results by performing a series of experiments under three different convolutional neural network architectures (VGG16, ResNet34, and Inception-ResNet-v2) and on different datasets (CIFAR10 and CIFAR100). The experimental results show that under the Inception-ResNet-v2 architecture, our method achieves an average accuracy of up to 81.15% on the CIFAR10 dataset, fully demonstrating its effectiveness in mitigating adversarial attacks. In addition, under the WRN-28-10 architecture, our method achieves not only 91.28% standard accuracy on the CIFAR10 dataset but also 76.46% average robust accuracy. The test experiment on the model training time consumption shows that our defense method has an advantage in time cost, proving that our defense method is a lightweight and efficient defense strategy.
... These networks have been applied for many different tasks, such as watermark production, pre-processing, embedding location recognition, choosing the right embedding strength, and watermark embedding and extraction. Among deep-learning models widely used in watermarking techniques, we can mention the convolutional neural network (CNN) [73], the recurrent neural network (RNN) [74], the generative adversarial network (GAN) [75], and the Autoencoder [76]. ...
Article
Full-text available
Digital video watermarking is a crucial method for protecting intellectual property, verifying authenticity, and ensuring data integrity in an era where video content is widely shared and distributed. With the advent of deep learning, video watermarking techniques have witnessed significant advancements, offering improved robustness and invisibility. This survey paper aims to provide a comprehensive overview of the latest developments in video watermarking with a focus on deep learning-based approaches. We categorize the surveyed techniques based on network architectures and embedding domains and thoroughly examine and compare the most widely used deep learning-based video watermarking techniques. Moreover, we discuss practical challenges and potential future directions to inspire further research in this dynamic field.
... Aby poprawić dokładność segmentacji, informacje pochodzące z kilku wcześniejszych warstw dodaje się do zgrubnej formy. Zgodnie z [13] operację upsamplingu można wykonać przy użyciu transponowanej konwolucji. FCN-8 sumuje 2 × upsamplowaną warstwę conv7 z warstwą pool4. ...
Article
The article discusses a method for automatic diagnostics of a railway track. It consists in automatic evaluation of the technical condition of selected track elements, such as rails, wooden and concrete sleepers, fasteners and turnouts. It was carried out on the basis of analysis of video images of railroad track elements recorded by two line cameras placed on the diagnostic carriage. The selected FCN-8 deep learning neural network was used to assess the technical condition of the surveyed elements, and the effectiveness of the applied algorithm was determined on the basis of such measures as IoU, Precision, Recall. Conclusions on the application of the FCN-8 network in the automatic classification of features of selected railroad track elements are presented. The results obtained were compared with other methods used in vision diagnostics.
... where MaxPool n×n denotes the max pooling operation with kernel size n × n, ConvT denotes the transposed convolution [52], and F DB1-VH , F DB2-VH , F DB4-VH , and F DB5-VH represent the feature maps obtained by the downsampling and upsampling operations, respectively, which are then concatenated using the Concat operation, transformed by the convolution block f 1×1 comprising sequential operations of 1 × 1 convolution, batch normalization (BN) [53], and ReLU [36], and finally added in residual to derive the fused feature maps of F ′ DB3-VH . Additionally, as mentioned above, the output feature maps F ′ DB3-VH and F ′ DB4-VH are further fused. ...
Article
Full-text available
With the rapid development of the modern world, it is imperative to achieve effective and efficient monitoring for territories of interest, especially for the broad ocean area. For surveillance of ship targets at sea, a common and powerful approach is to take advantage of satellite synthetic aperture radar (SAR) systems. Currently, using satellite SAR images for ship classification is a challenging issue due to complex sea situations and the imaging variances of ships. Fortunately, the emergence of advanced satellite SAR sensors has shed much light on the SAR ship automatic target recognition (ATR) task, e.g., utilizing dual-polarization (dual-pol) information to boost the performance of SAR ship classification. Therefore, in this paper we have developed a novel cross-polarimetric interaction network (CPINet) to explore the abundant polarization information of dual-pol SAR images with the help of deep learning strategies, leading to an effective solution for high-performance ship classification. First, we establish a novel multiscale deep feature extraction framework to fully mine the characteristics of dual-pol SAR images in a coarse-to-fine manner. Second, to further leverage the complementary information of dual-pol SAR images, we propose a mixed-order squeeze–excitation (MO-SE) attention mechanism, in which the first- and second-order statistics of the deep features from one single-polarized SAR image are extracted to guide the learning of another polarized one. Then, the intermediate multiscale fused and MO-SE augmented dual-polarized deep feature maps are respectively aggregated by the factorized bilinear coding (FBC) pooling method. Meanwhile, the last multiscale fused deep feature maps for each single-polarized SAR image are also individually aggregated by the FBC. Finally, four kinds of highly discriminative deep representations are obtained for loss computation and category prediction. For better network training, the gradient normalization (GradNorm) method for multitask networks is extended to adaptively balance the contribution of each loss component. Extensive experiments on the three- and five-category dual-pol SAR ship classification dataset collected from the open and free OpenSARShip database demonstrate the superiority and robustness of CPINet compared with state-of-the-art methods for the dual-polarized SAR ship classification task.
... Reconstruction of images into patches is one of the most efficient and reliable data augmentation mechanisms [14]. Hence one of the key features added to our model for data increment and the detection of narrow waterbodies is the Patch Adaptive Network (PAN) inspired by the Kernel filter of the CNN architecture [22]. For the inherent class imbalance in deep learning, we employed the Class Pixellevel Balancing (CPB) to our PAN and the Tversky loss function [23]. ...
Article
Full-text available
This paper presents the SARSNet architecture, developed to address the growing challenges in Synthetic Aperture Radar (SAR) deep learning-based automatic water body extraction. Such a task is riddled with significant challenges, encompassing issues like cloud interference, scarcity of annotated dataset, and the intricacies associated with varied topography. Recent strides in Convolutional Neural Networks (CNNs) and multispectral segmentation techniques offer a promising avenue to address these predicaments. In our research, we propose a series of solutions to elevate the process of water body segmentation. Our proposed solutions span several domains, including image resolution enhancement, refined extraction techniques tailored for narrow water bodies, self-balancing of the class pixel level, and minority class-influenced loss function, all aimed at amplifying prediction precision and streamlining computational complexity inherent in deep neural networks. The framework of our approach includes the introduction of a multichannel Data-Fusion Register, the incorporation of a CNN-based Patch Adaptive Network augmentation method, and the integration of class pixel level balancing and the Tversky loss function. We evaluated the performance of the model using the Sentinel-1 SAR electromagnetic signal dataset from the Earth flood water body extraction competition organized by the artificial intelligence department of Microsoft. In our analysis, our suggested SARSNet was compared to well-known semantic segmentation models, and a comprehensive assessment demonstrates that SARSNet consistently outperforms these models in all data subsets, including training, validation, and testing sets.
... Therefore, appropriately designing the parameters for the transposed convolution is vital for autoencoder design. The output dimensions H out of a convolutional layer can be calculated using the following formula (Dumoulin & Visin 2018): ...
Article
Full-text available
In recent years, technological advances and widespread monitoring equipment have led to significant time-series data in the mining industry, especially for hydraulic support pressure. This data is vital for miner safety and predicting pressure cycles. However, the harsh conditions of coal mines often degrade data quality, making anomaly detection essential. This study introduces an anomaly detection model based on Dynamic Time Warping (DTW) and convolutional autoencoders to identify anomalies in hydraulic support data. The model’s encoder consists of three convolutional layers and two pooling layers, while the decoder comprises five transposed convolutional layers, compressing sequence length to one-fourth of the original. By setting a 5-min sampling interval, with each sample containing 288 time steps and using a sliding window with a stride of 72, an optimal dataset is generated. Training results indicate that the model successfully detects anomaly points and subsequences, accurately learning and simulating normal operational patterns of hydraulic supports, achieving early anomaly detection. The model performs stably on both training and validation sets, with the reconstruction error (MSE) reduced to 0.001. For anomaly detection in the test set, we used the sum of the mean and standard deviation of the reconstruction error from the validation set as the detection threshold (0.0041). The results show that the mode of the number of anomaly points in the test samples is 8, with an average of approximately 10. Furthermore, this study analyzes the model's limitations under specific conditions and proposes improvements to enhance accuracy and robustness.
... Filter with a particular size (e.g., FW: filter width, FH: filter height) travels throughout the input image [38]. In order to reduce loss during feature extraction, padding inserts zero values around the input image [17]. In convolutions, the stride can specify the filter's step size [2].In order to create a feature map, the convolution layer applies a convolution filter on the two-dimensional input data. ...
Article
Background The facial landmark annotation of 3D facial images is crucial in clinical orthodontics and orthognathic surgeries for accurate diagnosis and treatment planning. While manual landmarking has traditionally been the gold standard, it is labour-intensive and prone to variability. Objective This study presents a framework for automated landmark detection in 3D facial images within a clinical context, using convolutional neural networks (CNNs), and it assesses its accuracy in comparison to that of ground-truth data. Material and methods Initially, an in-house dataset of 408 3D facial images, each annotated with 37 landmarks by an expert, was constructed. Subsequently, a 2.5D patch-based CNN architecture was trained using this dataset to detect the same set of landmarks automatically. Results The developed CNN model demonstrated high accuracy, with an overall mean localization error of 0.83 ± 0.49 mm. The majority of the landmarks had low localization errors, with 95% exhibiting a mean error of less than 1 mm across all axes. Moreover, the method achieved a high success detection rate, with 88% of detections having an error below 1.5 mm and 94% below 2 mm. Conclusion The automated method used in this study demonstrated accuracy comparable to that achieved with manual annotations within clinical settings. In addition, the proposed framework for automatic landmark localization exhibited improved accuracy over existing models in the literature. Despite these advancements, it is important to acknowledge the limitations of this research, such as that it was based on a single-centre study and a single annotator. Future work should address computational time challenges to achieve further enhancements. This approach has significant potential to improve the efficiency and accuracy of orthodontic and orthognathic procedures.
Article
Full-text available
Accurate estimations of near-surface S-wave velocity (Vs) models hold particular significance in geological and engineering investigations. On the one hand, the popular Multichannel Analysis of Surface Waves (MASW) is limited to the 1D and the plane wave assumptions. On the other hand, the more advanced and computationally expensive full-waveform inversion (FWI) approach is often solved within a deterministic framework that hampers an accurate uncertainty assessment and makes the final predictions heavily reliant on the starting model. Here we combine deep learning with Discrete Cosine Transforms (DCT) to solve the FWI of surface waves and to efficiently estimate the inversion uncertainties. Our neural network approach effectively learns the inverse non-linear mapping between DCT-compressed seismograms and DCT-compressed S-velocity models. The incorporation of DCT into the deep learning framework provides several advantages: it notably reduces parameter space dimensionality and alleviates the ill-conditioning of the problem. Additionally, it decreases the complexity of the network architecture and the computational cost for the training phase compared to training in the full domain. A Monte Carlo simulation is also used to propagate the uncertainties from the data to the model space. We first test the implemented inversion method on synthetic data to showcase the generalization capabilities of the trained network and to explore the implications of incorrect noise assumptions in the recorded seismograms and inaccurate wavelet estimations. Further, we demonstrate the applicability of the implemented method to field data. In this case, available borehole information is used to validate our predictions. In both the synthetic and field applications, the predictions provided by the proposed method are compared with those of a deterministic FWI and the outcomes of a network trained in the full data and model spaces. Our experiments confirm that the implemented deep-learning inversion efficiently and successfully solves the FWI problem and yields more accurate and stable results than a network trained without the DCT compression. This opens the possibility to efficiently train a neural network that provides accurate instantaneous predictions of Vs near-surface models and related uncertainties.
Article
Convolution neural networks (CNNs) are omnipresent in modern computer vision models and also widely used in other tasks such as voice recognition, time series analysis, machine translation, etc. In the present paper, we introduce a novel architecture of CNNs using dynamic convolutions in which the kernels are generated based on the input data. We apply this architecture to the image matching problem and develop a two-branch network in which one branch generates kernels used in convolutional layers of the other branch. We test our model on a canonical MNIST benchmark and demonstrate that it shows faster learning and better performance than the baseline model with standard convolutions. Potential applications of our architecture includes numerous problems in image analysis, time series forecasting, physical-informed machine learning, etc.
Article
Advancements in space exploration and computing have accelerated progress in remote sensing studies, where imaging satellites capture extensive datasets globally, particularly in identifying green areas on agricultural lands essential for monitoring natural resources, promoting sustainable agriculture, and mitigating climate change. Large-volume satellite images from 2020 were obtained from https:// tile. kayse ri. bel. tr/ tilec ache/ Cache/ 2020U YDU38 57/z/ x/y. jpeg. The images are stored on the server address of Kayseri Metropolitan Municipality. Traditional techniques struggle with classifying big data from these satellite views, necessitating innovative approaches like DGAG (Detect Green Areas with Geolocation), a novel method that combines interdisciplinary techniques to detect and geographically delineate green areas on agricultural lands globally. DGAG utilizes map-based open-source software to convert large-scale satellite views into processable images with spatial information, employing segmentation-based deep learning techniques such as You Only Look Once version 5 (YOLOv5) and Mask Region-based Convolutional Neural Network (Mask R-CNN) to extract green areas and determine pixel boundaries. The pixel borders are then transformed into spatial polygon data, providing hectare-level spatial information. Testing on actual satellite views of Kayseri province yielded promising results, with DGAG YOLOv5 and Mask R-CNN achieving F1 scores of 0.917 and 0.922, respectively. Notably, DGAG Mask R-CNN outperformed YOLOv5 by detecting 834626.42 square meters more green area. In terms of runt-ime, DGAG Mask R-CNN detected green areas in approximately 0.031 s, while DGAG YOLOv5 operated roughly twice as fast, detecting green areas in about 0.015 s.
Article
Intelligent vehicles detection (IVD) provides information to manage traffic efficiently, drive autonomous vehicles and feed data to intelligent traffic management systems (ITMS). IVD is a challenging task for the close proximity vehicles in lane-less traffic and heterogeneous systems. Most vehicle detection models are complex and limited to multi-scale feature extraction due to the involvement of existing feature extraction backbones. Also, they do not include heterogeneous traffic vehicles, usually present in developing countries. Therefore, this paper proposes a multi-class vehicle detection (MCVD) model to detect vehicles in heterogeneous traffic using a realistic traffic dataset from a developing country. MCVD is a deep learning (DL) model that consists of a convolutional neural network backbone called VDnet, a light fusion bi-directional feature pyramid network (LFBFPN) and a modified vehicle detection head (MVDH). VDnet extracts multi-scale features from the traffic input images using feature reuse methods. LFBFPN combines these features bi-directionally and provides robust feature maps. Finally, MVDH is applied to detect multi-class vehicles and classify them into respective categories. The proposed model achieves 91.45% mean average precision (mAP) on the heterogeneous traffic labeled dataset (HTLD). The proposed MCVD is tested over Nvidia Jetson TX2 edge computing boards to verify the real-time performance. It achieves 17 frames per second (FPS) on TX2. The performance evaluation results indicate that the proposed MCVD model is fast, accurate and better than the existing works.
Chapter
This chapter investigates the potential of graph neural networks (GNNs) in multi-image super-resolution (MISR), positioning them against current state-of-the-art techniques. It introduces foundational concepts of super-resolution reconstruction (SRR), with a focus on the distinctions between single-image super-resolution (SISR) and MISR, and reviews key models in the field. Theoretical underpinnings of graph theory relevant to SRR are explored, alongside detailed descriptions of the architecture and data preparation methodologies for two GNN models designed specifically for MISR tasks. An evaluation conducted using benchmark datasets assesses these models through a training procedure and a comprehensive set of performance metrics. Findings reveal both the strengths and limitations of GNNs in enhancing image resolution, showcasing their effectiveness, and identifying areas where improvements are needed compared to leading methods. The chapter concludes with a discussion of the implications of these results, highlighting the promising yet challenging role of GNNs in MISR and suggesting directions for future research to advance this field.
Article
Full-text available
Background Energy systems, as critical infrastructures (CI), constitute Cyber-Physical-Social Systems (CPSS). Due to their inherent complexity and the importance of service continuity of CIs, digitization in this context encounters significant practical challenges. Digital Twins (DT) have emerged over the recent years as a promising solution for managing CPSSs by facilitating real-time interaction, synchronization, and control of physical assets. The selection of an appropriate architectural framework is crucial in constructing a DT, to ensure integration of enabling technologies and data from diverse sources. Objectives This study proposes a Systematic Literature Review (SLR) to examine technological enablers, design choices, management strategies and Computational Challenges of DTs in Smart Energy Systems (SES) by also analyzing existing architectures and identifying key components. Methods The SLR follows a rigorous workflow exploiting a multi-database search with predefined eligibility criteria, accompanied by advanced searching techniques, such as manual screening of results and a documented search strategy, in order to ensure its comprehensiveness and reliability, More specifically, research questions are first defined and then submitted as queries to scientific digital libraries (i.e., IEEE Xplore, Scopus, and WoS) selected due to their coverage and reliability (Google Scholar was excluded for the presence of grey literature and non-peer-reviewed material). Then, inclusion and exclusion criteria are established to filter the results and shortlist the significant publications. Subsequently, relevant data are extracted, summarized, and categorized in order to identify common themes, existing gaps, and future research directions, with the aim of providing a comprehensive overview of the current state of DTs for SESs. Results From the proposed DT-based solutions described in the selected publications, the adopted architectures are examined and categorized depending on their logical building blocks, microservices, enabling technologies, human–machine interfaces (HMI), artificial intelligence and machine learning (AI/ML) implementations, data flow and data persistence choices, and Internet-of-Things (IoT) components involved. Additionally, the integration of edge-cloud computing and IoT technologies in literature are studied and discussed. Finally, gaps, opportunities, future study lines, and challenges of implementing DTs are thoroughly addressed. The results achieved also pave the way for a forthcoming design pattern catalog for DTs in CPSSs capable of supporting the engineering and research communities, by offering practical insights on implementation and integration aspects. Conclusion The proposed SLR provides a valuable resource for designing and implementing DTs of CPSSs in general and of SESs in particular. Furthermore, it highlights the potential benefits of adopting DTs to manage complex energy systems and it identifies areas for future research.
Article
Full-text available
Gatys et al. (2015) showed that optimizing pixels to match features in a convolutional network with respect reference image features is a way to render images of high visual quality. We show that unrolling this gradient-based optimization yields a recurrent computation that creates images by incrementally adding onto a visual "canvas". We propose a recurrent generative model inspired by this view, and show that it can be trained using adversarial training to generate very good image samples. We also propose a way to quantitatively compare adversarial networks by having the generators and discriminators of these networks compete against each other.
Article
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
Full-text available
We propose a structured prediction architecture for images centered around deep recurrent neural networks. The proposed network, called ReSeg, is based on the recently introduced ReNet model for object classification. We modify and extend it to perform object segmentation, noting that the avoidance of pooling can greatly simplify pixel-wise tasks for images. The ReSeg layer is composed of four recurrent neural networks that sweep the image horizontally and vertically in both directions, along with a final layer that expands the prediction back to the original image size. ReSeg combines multiple ReSeg layers with several possible input layers as well as a final layer which expands the prediction back to the original image size, making it suitable for a variety of structured prediction tasks. We evaluate ReSeg on the specific task of object segmentation with three widely-used image segmentation datasets, namely Weizmann Horse, Fashionista and Oxford Flower. The results suggest that ReSeg can challenge the state of the art in object segmentation, and may have further applications in structured prediction at large.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
Theano is a linear algebra compiler that optimizes a user's symbolically-specified mathematical computations to produce efficient low-level implementations. In this paper, we present new features and efficiency improvements to Theano, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.
Article
Full-text available
Theano is a compiler for mathematical expressions in Python that combines the convenience of NumPy's syntax with the speed of optimized native machine language. The user composes mathematical expressions in a high-level description that mimics NumPy's syntax and semantics, while being statically typed and functional (as opposed to imperative). These expressions allow Theano to provide symbolic differentiation. Before performing computation, Theano optimizes the choice of expressions, translates them into C++ (or CUDA for GPU), compiles them into dynamically loaded Python modules, all automatically. Common machine learn-ing algorithms implemented with Theano are from 1.6× to 7.5× faster than competitive alternatives (including those implemented with C/C++, NumPy/SciPy and MATLAB) when compiled for the CPU and between 6.5× and 44× faster when compiled for the GPU. This paper illustrates how to use Theano, outlines the scope of the compiler, provides benchmarks on both CPU and GPU processors, and explains its overall design.
Conference Paper
Full-text available
Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter re- sponses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be bro- ken down into two steps: (1) a coding step, which per- forms a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pool- ing step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pool- ing schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the rela- tive importance of each step of mid-level feature extrac- tion through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the aver- age, or the maximum), which obtains state-of-the-art per- formance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature ex- tractors, our approachaims to facilitate the design of better recognition architectures.
Conference Paper
Full-text available
Invariant representations in object recognition systems are generally obtained by pooling feature vectors over spatially local neighborhoods. But pooling is not local in the feature vector space, so that widely dissimilar features may be pooled together if they are in nearby locations. Recent approaches rely on sophisticated encoding methods and more specialized codebooks (or dictionaries), e.g., learned on subsets of descriptors which are close in feature space, to circumvent this problem. In this work, we argue that a common trait found in much recent work in image recognition or retrieval is that it leverages locality in feature space on top of purely spatial locality. We propose to apply this idea in its simplest form to an object recognition system based on the spatial pyramid framework, to increase the performance of small dictionaries with very little added engineering. State-of-the-art results on several object recognition benchmarks show the promise of this approach.
Conference Paper
Full-text available
We propose a new machine learning paradigm called multilayer graph transformer network that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as input and produce graphs as output. A complete check reading system based on this concept is described. The system combines convolutional neural network character recognizers with graph-based stochastic models trained cooperatively at the document level. It is deployed commercially and reads million of business and personal checks per month with record accuracy
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Article
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in-terfaced to third-party software thanks to Lua's light interface.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Conference Paper
Recently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the timeconsuming learning process. We then show that a surprising fraction of the performance of certain state-of-the-art methods can be attributed to the architecture alone. 1.
Conference Paper
We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This makes it possible to learn multiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. When combined with a standard classifier, features extracted from these models outperform SIFT, as well as representations from other feature learning methods.
Deep learning. Book in preparation for
  • I Goodfellow
  • Y Bengio
  • A Courville
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. Book in preparation for MIT Press.
  • F Bastien
  • P Lamblin
  • R Pascanu
  • J Bergstra
  • I Goodfellow
  • A Bergeron
  • N Bouchard
  • D Warde-Farley
  • Y Bengio
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-Farley, D., and Bengio, Y. (2012). Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590.
  • D J Im
  • C D Kim
  • H Jiang
  • R Memisevic
Im, D. J., Kim, C. D., Jiang, H., and Memisevic, R. (2016). Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110.
Caffe: Convolutional architecture for fast feature embedding
  • Y Jia
  • E Shelhamer
  • J Donahue
  • S Karayev
  • J Long
  • R Girshick
  • S Guadarrama
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675-678. ACM.