Conference PaperPDF Available

Convolutional Networks for Images, Speech, and Time-Series

Authors:

Abstract and Figures

INTRODUCTIONThe ability of multilayer back-propagation networks to learn complex, high-dimensional, nonlinearmappings from large collections of examples makes them obvious candidates for imagerecognition or speech recognition tasks (see PATTERN RECOGNITION AND NEURALNETWORKS). In the traditional model of pattern recognition, a hand-designed featureextractor gathers relevant information from the input and eliminates irrelevant variabilities.A trainable classifier then categorizes the...
Content may be subject to copyright.
INPUT
28x28 feature maps
4@24x24 feature maps
4@12x12 feature maps
12@8x8 feature maps
12@4x4 OUTPUT
26@1x1
Subsampling
Convolution
Convolution
Subsampling
Convolution
SDNN
Single
Character
Recognizer
... Certain studies incorporate macroeconomic factors and volatility measures to capture broader market sentiment [13]. Moreover, feature extractors like multi-layer perceptrons (MLPs), Convolutional Neural Networks (CNNs) [22], and LSTM networks have been used to detect patterns in these time-series data, enabling RL agents to make informed decisions about market conditions [5]. The choice of feature extractor plays a critical role in shaping the agent's capacity to detect trends and forecast market behavior [13]. ...
... The CNN [22] employs a different architecture optimized for processing structured data, such as images or time series, where local patterns and correlations are important. Instead of fully connecting each layer, CNNs apply convolutional filters to local regions of the input, allowing the network to extract hierarchical features. ...
Article
Full-text available
This paper presents a systematic exploration of deep reinforcement learning (RL) for portfolio optimization and compares various agent architectures, such as the DQN, DDPG, PPO, and SAC. We evaluate these agents’ performance across multiple market signals, including OHLC price data and technical indicators, while incorporating different rebalancing frequencies and historical window lengths. This study uses six major financial indices and a risk-free asset as the core instruments. Our results show that CNN-based feature extractors, particularly with longer lookback periods, significantly outperform MLP models, providing superior risk-adjusted returns. DQN and DDPG agents consistently surpass market benchmarks, such as the S&P 500, in annualized returns. However, continuous rebalancing leads to higher transaction costs and slippage, making periodic rebalancing a more efficient approach to managing risk. This research offers valuable insights into the adaptability of RL agents to dynamic market conditions, proposing a robust framework for future advancements in financial machine learning.
... The output layer in this study is responsible for classifying whether electricity theft has been detected [41]. ...
Article
Full-text available
In smart power grid, consumers can hack their smart meters to report low electricity consumption readings to reduce their bills launching electricity theft cyberattacks. This study investigates a Trojan attack in federated learning of a detector for electricity theft. In this attack, dishonest consumers train the detector on false data to later bypass detection, without degrading the detector’s overall performance. We propose three defense strategies: Redundancy, Med-Selection and Combined-Selection . In the Redundancy approach, redundant consumers with similar consumption patterns are included in the federated learning process, so their correct data offsets the attackers’ false data when the local models are aggregated. Med-Selection selects the median model parameters of consumers with similar usage patterns to reduce outlier influence. In Combined-Selection , we compare gradients from consumers with same consumption patterns to the median of all local models, leveraging the fact that honest consumers’ gradients are closer to the median while malicious ones deviate. Our experiments using real-world data show the Trojan attack’s success rate can reach 90%. However, our defense methods reduce the attack success rate to about 7%, 4%, and 3.3% for Redundancy, Med-Selection , and Combined-Selection , respectively, when 10% of consumers are malicious.
... In order to predict the optimal motor commands to maximize grasp success, we use convolutional neural networks (CNNs) trained on grasp success prediction. Although the technology behind CNNs has been known for decades (LeCun & Bengio, 1995), they have achieved remarkable success in recent years on a wide range of challenging computer vision benchmarks (Krizhevsky et al., 2012), becoming the de facto standard for computer vision systems. However, applications of CNNs to robotic control problems has been less prevalent, compared to applications to passive perception tasks such as object recognition (Krizhevsky et al., 2012;Wohlhart & Lepetit, 2015), localization (Girshick et al., 2014), and segmentation (Chen et al., 2014). ...
Preprint
We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.
... Let X = (x 1 , ..., x i , ..., x n ) denote the raw waveform of an utterance, where each x i is a sampled data point. The self-supervised learning methods first extract representations Z = (z 1 , ..., z t , ..., z T ) for each frame t using a function h, which in general can be either a human-designed function, like an MFCC extractor [44], or a learned deep network, like a convolution neural network [45]. There may be a special case when X represents human-designed spectral features. ...
Preprint
Recently, self-supervised learning (SSL) from unlabelled speech data has gained increased attention in the automatic speech recognition (ASR) community. Typical SSL methods include autoregressive predictive coding (APC), Wav2vec2.0, and hidden unit BERT (HuBERT). However, SSL models are biased to the pretraining data. When SSL models are finetuned with data from another domain, domain shifting occurs and might cause limited knowledge transfer for downstream tasks. In this paper, we propose a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce domain shifting in pretrained speech models, and evaluate it for a causal and non-causal transformer. For the causal transformer, an extension of APC (E-APC) is proposed to learn richer information from unlabelled data by using multiple temporally-shifted sequences to perform prediction. For the non-causal transformer, various solutions for using the bidirectional APC (Bi-APC) are investigated. In addition, the DRAFT framework is examined for Wav2vec2.0 and HuBERT methods, which use non-causal transformers as the backbone. The experiments are conducted on child ASR (using the OGI and MyST databases) using SSL models trained with unlabelled adult speech data from Librispeech. The relative WER improvements of up to 19.7% on the two child tasks are observed when compared to the pretrained models without adaptation. With the proposed methods (E-APC and DRAFT), the relative WER improvements are even larger (30% and 19% on the OGI and MyST data, respectively) when compared to the models without using pretraining methods.
... The architecture of an Artificial Neural Network (ANN) strongly influences its performance [17,11,26]. However, designing the structure of an artificial neural network is a complex task requiring expert knowledge and extensive experimentation. ...
Preprint
Artificial Neural Networks have shown impressive success in very different application cases. Choosing a proper network architecture is a critical decision for a network's success, usually done in a manual manner. As a straightforward strategy, large, mostly fully connected architectures are selected, thereby relying on a good optimization strategy to find proper weights while at the same time avoiding overfitting. However, large parts of the final network are redundant. In the best case, large parts of the network become simply irrelevant for later inferencing. In the worst case, highly parameterized architectures hinder proper optimization and allow the easy creation of adverserial examples fooling the network. A first step in removing irrelevant architectural parts lies in identifying those parts, which requires measuring the contribution of individual components such as neurons. In previous work, heuristics based on using the weight distribution of a neuron as contribution measure have shown some success, but do not provide a proper theoretical understanding. Therefore, in our work we investigate game theoretic measures, namely the Shapley value (SV), in order to separate relevant from irrelevant parts of an artificial neural network. We begin by designing a coalitional game for an artificial neural network, where neurons form coalitions and the average contributions of neurons to coalitions yield to the Shapley value. In order to measure how well the Shapley value measures the contribution of individual neurons, we remove low-contributing neurons and measure its impact on the network performance. In our experiments we show that the Shapley value outperforms other heuristics for measuring the contribution of neurons.
... 32,33 The key advantage of a DCNN is the ability for the convolutional layers to 'learn' abstract features that are mostly independent of position and scale, allowing them to be useful in predictions. 34 Interestingly, whether a DCNN can be used in the determination of crystal structure is unknown, because in this case, the features (diffraction spots) are identical, but the classification hinges on their relative positions and angles. In theory, this is a poorly suited task for fully convolutional neural networks, given that each layer only received input from the node immediately previous. ...
Preprint
Understanding transformations under electron beam irradiation requires mapping the structural phases and their evolution in real time. To date, this has mostly been a manual endeavor comprising of difficult frame-by-frame analysis that is simultaneously tedious and prone to error. Here, we turn towards the use of deep convolutional neural networks (DCNN) to automatically determine the Bravais lattice symmetry present in atomically-resolved images. A DCNN is trained to identify the Bravais lattice class given a 2D fast Fourier transform of the input image. Monte-Carlo dropout is used for determining the prediction probability, and results are shown for both simulated and real atomically-resolved images from scanning tunneling microscopy and scanning transmission electron microscopy. A reduced representation of the final layer output allows to visualize the separation of classes in the DCNN and agrees with physical intuition. We then apply the trained network to electron beam-induced transformations in WS2, which allows tracking and determination of growth rate of voids. These results are novel in two ways: (1) It shows that DCNNs can be trained to recognize diffraction patterns, which is markedly different from the typical "real image" cases, and (2) it provides a method with in-built uncertainty quantification, allowing the real-time analysis of phases present in atomically resolved images.
... Sparseness is desirable and full connectivity is unnecessary. In fact, Han et al. (2015) have shown that many weak connections in the fully-connected layers of Convolutional Neural Networks (CNNs) (LeCun and Bengio, 1995) can be pruned without incurring any accuracy loss. The convolutional layers of CNNs are sparse, and the fact is considered one of the key factors that lead to the success of CNNs. ...
Preprint
Latent tree analysis seeks to model the correlations among a set of random variables using a tree of latent variables. It was proposed as an improvement to latent class analysis --- a method widely used in social sciences and medicine to identify homogeneous subgroups in a population. It provides new and fruitful perspectives on a number of machine learning areas, including cluster analysis, topic detection, and deep probabilistic modeling. This paper gives an overview of the research on latent tree analysis and various ways it is used in practice.
Article
Full-text available
Transfer learning techniques for structural health monitoring in bridge-type structures are investigated, focusing on model generalizability and domain adaptation challenges. Finite element models of bridge-type structures with varying geometry were simulated using the OpenSeesPy platform. Different levels of damage states were introduced at the midspans of these models, and Gaussian-based load time histories were applied at mid-span for dynamic time-history analysis to calculate acceleration data. Then, this acceleration time-history series was transformed into grayscale images, serving as inputs for a Convolutional Neural Network developed to detect and classify structural damage states. Initially, it was trained and tested on datasets derived from a Single-Source Domain structure, achieving perfect accuracy (1.0) in a ten-label multi-class classification task. However, this accuracy significantly decreased when the model was sequentially tested on structures with different geometry without retraining. To address this challenge, it is proposed that transfer learning be employed via feature extraction and joint training. The model showed a reduction in accuracy percentage when adapting from a Single-Source Domain to Multiple-Target Domains, revealing potential issues with non-homogeneous data distribution and catastrophic forgetting. Conversely, joint training, which involves training on all datasets except the specific Target Domain, generated a generalized network that effectively mitigated these issues and maintained high accuracy in predicting unseen class labels. This study highlights the integration of simulation data into the Deep Learning-based SHM framework, demonstrating that a generalized model created via Joint Learning utilizing FEM can potentially reduce the consequences of modeling errors and operational uncertainties unavoidable in real-world applications.
Article
Full-text available
Former experiments have shown the benefit of using specific multi-layer architectures, the so-called time delay neural networks, for phoneme recognition (Waibel, Hanazawa, Hinton, Shikano, & Lang, 1988). Similar experiments on a speaker-independent task were also performed on a small set of minimal pairs (Bottou, 1988). In this paper we focus on a speaker-independent, global word recognition task with time delay networks. We first describe these networks as a way for learning feature extractors by constrained back-propagation. Such a time-delay network is shown to be capable of dealing with a near real-sized problem: French digit recognition. The results are discussed and compared, on the same data sets, with those obtained with a classical time warping system.
Conference Paper
Full-text available
This paper describes the use of a convolutional neural network to perform address block location on machine-printed mail pieces. Locating the address block is a dicult object recognition problem because there is often a large amount of extraneous printing on a mail piece and because address blocks vary dramatically in size and shape. We used a convolutional locator network with four outputs, each trained to nd a dierent corner of the address block. A simple set of rules was used to generate ABL candidates from the network output. The system performs very well: when allowed ve guesses, the network will tightly bound the address delivery information in 98.2% of the cases.
Chapter
Threshold functions and related operators are widely used as basic elements of adaptive and associative networks [Nakano 72, Amari 72, Hopfield 82]. There exist numerous learning rules for finding a set of weights to achieve a particular correspondence between input-output pairs. But early works in the field have shown that the number of threshold functions (or linearly separable functions) in N binary variables is small compared to the number of all possible boolean mappings in N variables, especially if N is large. This problem is one of the main limitations of most neural networks models where the state is fully specified by the environment during learning: they can only learn linearly separable functions of their inputs. Moreover, a learning procedure which requires the outside world to specify the state of every neuron during the learning session can hardly be considered as a general learning rule because in real-world conditions, only a partial information on the “ideal” network state for each task is available from the environment. It is possible to use a set of so-called “hidden units” [Hinton,Sejnowski,Ackley. 84], without direct interaction with the environment, which can compute intermediate predicates. Unfortunately, the global response depends on the output of a particular hidden unit in a highly non-linear way, moreover the nature of this dependence is influenced by the states of the other cells.
Article
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
Conference Paper
The backpropagation algorithm can be used for both recognition and generationof time trajectories. When used as a recognizer, it has been shownthat the performance of a network can be greatly improved by addingstructure to the architecture. The same is true in trajectory generation.In particular a new architecture corresponding to a "reversed" TDNN isproposed. Results show dramatic improvement of performance in the generationof hand-written characters. Acombination of TDNN and...
Article
We describe a system which can recognize digits and uppercase letters handprinted on a touch terminal. A character is input as a sequence of [x(t), y(t)] coordinates, subjected to very simple preprocessing, and then classified by a trainable neural network. The classifier is analogous to “time delay neural networks” previously applied to speech recognition. The network was trained on a set of 12,000 digits and uppercase letters, from approximately 250 different writers, and tested on 2500 such characters from other writers. Classification accuracy exceeded 96% on the test examples.