Article

Enhancing parcel singulation efficiency through transformer-based position attention and state space augmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Object detection is a fundamental and challenging task in the computer vision field [21]. Its objective is to find the regions of interest in the image and give the results of classification and location. ...
... Consequently, the key approach involves establishing a visual memory of a series of SEM images and classifying the SEM image at the center of each region of interest. Typically, the process of embedding series of images is accomplished via methodologies based on implicit memory [19], such as Recurrent Neural Networks (RNN [20]), Long Short-Term Memory (LSTM [21]), and Gated Recurrent Units (GRU [22]). However, these techniques often face optimization challenges when handling the succession of SEM images. ...
Article
Lithography stands as a critical step in the manufacturing of integrated circuits, where the precise control of focus and exposure dose parameters is vital for optimal results. The conventional methodologies for defining lithography process windows often face difficulties with managing measurement errors, detecting printed defects, and exploiting visual features from Scanning Electron Microscope (SEM) images. This paper proposes LithoPW, a novel framework that utilizes visual features of SEM images for the determination of process windows. This approach is comprised of a denoising module, a Transformer-based visual memory encoder, and a defect-aware process window optimization module. The denoising module incorporates a Transformer architecture to mitigate the impact of noise, thereby enhancing the efficiency of downstream tasks in leveraging information embedded within SEM images. The transformer-based visual memory encoder discerns each SEM image as a Query, maintaining neighbouring SEM images in memory as Key and Value elements, thereby facilitating precise lithography quality classification associated with the query image. The defect-aware process window optimization module heightens the reliability of the results by adjusting the process window according to the defects identified within the SEM images. Experimental results confirm the efficacy of our framework, highlighting its promising application in lithography production for accurate process window determination.
Article
Full-text available
Despite some successful applications of goal-driven navigation, existing deep reinforcement learning (DRL)-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-the-art (SOTA) baselines. The demonstration video (https://www.youtube.com/watch?v=aqJCHcsj4w0) and the source code (https://github.com/OscarHuangWind/DRL-Transformer-SimtoReal-Navigation) are also provided.
Article
Full-text available
Reinforcement Learning (RL) can be considered as a sequence modeling task, where an agent employs a sequence of past state-action-reward experiences to predict a sequence of future actions. In this work, we propose St ate- A ction- R eward Transformer ( StAR former), a Transformer architecture for robot learning with image inputs, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. StARformer first extracts StAR-representations using self-attending patches of image states, action, and reward tokens within a short temporal window. These StAR-representations are combined with pure image state representations, extracted as convolutional features, to perform self-attention over the whole sequence. Our experimental results show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, under both offline-RL and imitation learning settings. We find that models can benefit from our combination of patch-wise and convolutional image embeddings. StARformer is also more compliant with longer sequences of inputs than the baseline method. Finally, we demonstrate how StARformer can be successfully applied to a real-world robot imitation learning setting via a human-following task.
Article
Full-text available
Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.
Article
Full-text available
Background Cardiac MRI is limited by long acquisition times, yet faster acquisition of smaller-matrix images reduces spatial detail. Deep learning (DL) might enable both faster acquisition and higher spatial detail via super-resolution. Purpose To explore the feasibility of using DL to enhance spatial detail from small-matrix MRI acquisitions and evaluate its performance against that of conventional image upscaling methods. Materials and Methods Short-axis cine cardiac MRI examinations performed between January 2012 and December 2018 at one institution were retrospectively collected for algorithm development and testing. Convolutional neural networks (CNNs), a form of DL, were trained to perform super resolution in image space by using synthetically generated low-resolution data. There were 70%, 20%, and 10% of examinations allocated to training, validation, and test sets, respectively. CNNs were compared against bicubic interpolation and Fourier-based zero padding by calculating the structural similarity index (SSIM) between high-resolution ground truth and each upscaling method. Means and standard deviations of the SSIM were reported, and statistical significance was determined by using the Wilcoxon signed-rank test. For evaluation of clinical performance, left ventricular volumes were measured, and statistical significance was determined by using the paired Student t test. Results For CNN training and retrospective analysis, 400 MRI scans from 367 patients (mean age, 48 years ± 18; 214 men) were included. All CNNs outperformed zero padding and bicubic interpolation at upsampling factors from two to 64 (P < .001). CNNs outperformed zero padding on more than 99.2% of slices (9828 of 9907). In addition, 10 patients (mean age, 51 years ± 22; seven men) were prospectively recruited for super-resolution MRI. Super-resolved low-resolution images yielded left ventricular volumes comparable to those from full-resolution images (P > .05), and super-resolved full-resolution images appeared to further enhance anatomic detail. Conclusion Deep learning outperformed conventional upscaling methods and recovered high-frequency spatial information. Although training was performed only on short-axis cardiac MRI examinations, the proposed strategy appeared to improve quality in other imaging planes. © RSNA, 2020 Online supplemental material is available for this article.
Article
Full-text available
Unmanned aerial vehicles (UAVs) have the potential in delivering internet of things (IoT) services from a great height, creating an airborne domain of the IoT. In this work, we address the problem of autonomous UAV navigation in large-scale complex environments by formulating it as a Markov decision process with sparse rewards, and propose an algorithm named deep reinforcement Learning with non-expert Helpers (LwH). In contrast to prior reinforcement learning-based methods that puts huge efforts into reward shaping, we adopt the sparse reward scheme, i.e., a UAV will be rewarded if and only if it completes navigation tasks. Using the sparse reward scheme ensures that the solution is not biased towards potentially sub-optimal directions. However, having no intermediate rewards hinders the agent from efficient learning since informative states are rarely encountered. To handle the challenge, we assume that a prior policy (non-expert helper) that might be of poor performance is available to the learning agent. The prior policy plays the role of guiding the agent in exploring the state space by reshaping the behavior policy used for environment interaction. It also assists the agent in achieving goals by setting dynamic learning objectives with increasing difficulty. To evaluate our proposed method, we construct a simulator for UAV navigation in large-scale complex environments and compare our algorithm with several baselines. Experimental results demonstrate that LwH significantly outperforms the state-of-the-art algorithms handling sparse rewards, and yields impressive navigation policies comparable to those learned in the environment with dense rewards.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Article
Full-text available
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to maintaining this rapid progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results difficult to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines, and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field, by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
Article
Full-text available
In this paper, we study the problem of designing objective functions for machine learning problems defined on finite \emph{sets}. In contrast to traditional objective functions defined for machine learning problems operating on finite dimensional vectors, the new objective functions we propose are operating on finite sets and are invariant to permutations. Such problems are widespread, ranging from estimation of population statistics \citep{poczos13aistats}, via anomaly detection in piezometer data of embankment dams \citep{Jung15Exploration}, to cosmology \citep{Ntampaka16Dynamical,Ravanbakhsh16ICML1}. Our main theorem characterizes the permutation invariant objective functions and provides a family of functions to which any permutation invariant objective function must belong. This family of functions has a special structure which enables us to design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks. We demonstrate the applicability of our method on population statistic estimation, point cloud classification, set expansion, and image tagging.
Article
Full-text available
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Article
Full-text available
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
Article
Full-text available
This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. A learning-theoretic analysis shows that reducing dissimilarity improves the expected generalization ability of classifiers on new domains, motivating the proposed algorithm. Experimental results on synthetic and real-world datasets demonstrate that DICA successfully learns invariant features and improves classifier performance in practice.
Conference Paper
Full-text available
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(lambda), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as non-linear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to non-linear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.
Article
Interactive visual navigation (IVN) involves tasks where embodied agents learn to interact with the objects in the environment to reach the goals. Current approaches exploit visual features to train a reinforcement learning (RL) navigation control policy network. However, RL-based methods continue to struggle at the IVN tasks as they are inefficient in learning a good representation of the unknown environment in partially observable settings. In this work, we introduce predictions of task-related latents (PTRLs), a flexible self-supervised RL framework for IVN tasks. PTRL learns the latent structured information about environment dynamics and leverages multistep representations of the sequential observations. Specifically, PTRL trains its representation by explicitly predicting the next pose of the agent conditioned on the actions. Moreover, an attention and memory module is employed to associate the learned representation to each action and exploit spatiotemporal dependencies. Furthermore, a state value boost module is introduced to adapt the model to previously unseen environments by leveraging input perturbations and regularizing the value function. Sample efficiency in the training of RL networks is enhanced by modular training and hierarchical decomposition. Extensive evaluations have proved the superiority of the proposed method in increasing the accuracy and generalization capacity.
Article
The efficiency of a robotic system is primarily determined by its ability to navigate complex and interactive environments. In real-world scenarios, cluttered surroundings are common, requiring a robot to navigate diverse spaces and displace objects to pave a path towards its objective. Consequently, “Visual Interactive Navigation” presents several challenges, including how to retain historical exploration information from partially observable visual signals, and how to utilize sparse rewards in reinforcement learning to simultaneously learn a latent representation and a control policy. Addressing these challenges, we introduce a Transformer-based Visual Memory Encoder (VME-Transformer), capable of embedding both recent and long-term exploration information into memory. Additionally, we explicitly estimate the robot's next pose, conditioned on the impending action, to bootstrap the learning process of the high-capacity VME-Transformer. We further regularize the value function by introducing input perturbations, thereby enhancing its generalization capabilities in previously unseen environments. In the Visual Interactive Navigation tasks within the iGibson environment, the VME-Transformer demonstrates superior performance compared to state-of-the-art methods, underlining its effectiveness.
Article
This paper presents a data-driven approach that adaptively tunes the parameters of a virtual synchronous generator to achieve optimal frequency response against disturbances. In the proposed approach, the control variables, namely, the virtual moment of inertia and damping factor, are transformed into actions of a reinforcement learning agent. Different from the state-of-the-art methods, the proposed study introduces the settling time parameter as one of the observations in addition to the frequency and rate of change of frequency (RoCoF). In the reward function, preset indices are considered to simultaneously ensure bounded frequency deviation, low RoCoF, fast response, and quick settling time. To maximize the reward, this study employs the Twin-Delayed Deep Deterministic Policy Gradient (TD3) algorithm. TD3 has an exceptional capacity for learning optimal policies and is free of overestimation bias, which may lead to suboptimal policies. Finally, numerical validation in MATLAB/Simulink and real-time simulation using RTDS confirm the superiority of the proposed method over other adaptive tuning methods.
Article
This paper introduces a novel reference tracking control approach implemented using a combination of the Actor-Critic Reinforcement Learning (RL) framework and the Grey Wolf Optimizer (GWO) algorithm. The classical neural network (NN)-based implementation of the Critic, optimized with the Gradient Descent (GD) algorithm, is replaced with the GWO algorithm, aiming to eliminate the main drawbacks of the GD algorithm, i.e., slow convergence and the tendency to get stuck in local optimal values. The combined effort from multiple search agents and the random values involved in the search process make the GWO algorithm very efficient in exploring the solution space and finding global optimal solutions. The main objective of the proposed approach is to build a NN-based controller capable of solving an optimal reference tracking control problem on nonlinear servo system laboratory equipment. The training data needed to build the controller is collected while the actor learns how to control the servo system, using the GWO-based critic to monitor the process and step in to correct the actor when needed. A comparison study is performed across three online RL-based control approaches, namely the novel approach using GWO to implement the Critic in the Actor-Critic RL framework, the traditional approach using NNs with GD for optimization and another approach using a metaheuristic algorithm called Particle Swarm Optimization (PSO). The experimental results illustrate the superiority of the proposed approach over the competing ones.
Article
Substantial progress has been achieved in embodied visual navigation based on reinforcement learning (RL). These studies presume that the environment is stationary where all the obstacles are static. However, in real cluttered scenes, interactable objects (e.g. shoes and boxes) blocking the way of robots makes the environment non-stationary. Accordingly, the ego-centric visual agent will easily get stuck in the dilemma of finding the next waypoint as it struggles to decide whether to push the obstacles ahead. To handle the predicament, we formulate this interactive visual navigation as Partial Observed Markov Decision Process (POMDP). As the transformer encoder has demonstrated its superior ability to capture the spatial-temporal dependencies in natural language processing. We propose a transformer-based memory to empower the agents utilizing the historical interactive information. However, directly leveraging the transformer architecture in the RL settings is highly unstable. We further propose a surrogate objective to predict the next waypoint as the auxiliary task, which facilitates the representation learning and bootstraps the RL. We demonstrate our method in the iGibson environment and experimental results show a significant improvement over the interactive Gibson benchmark and the related recurrent RL policy both in the validation seen scenes and the test unseen scenes.
Article
End-to-end approaches are one of the most promising solutions for autonomous vehicles (AVs) decision-making. However, the deployment of these technologies is usually constrained by the high computational burden. To alleviate this problem, we proposed a lightweight transformer-based end-to-end model with risk awareness ability for AV decision-making. Specifically, a lightweight network with depth-wise separable convolution and transformer modules was firstly proposed for image semantic extraction from time sequences of trajectory data. Then, we assessed driving risk by a probabilistic model with position uncertainty. This model was integrated into deep reinforcement learning (DRL) to find strategies with minimum expected risk. Finally, the proposed method was evaluated in three lane change scenarios to validate its superiority.
Article
Pickup and delivery problems with late penalties can be adopted to model a wide range of practical situations in the field of transportation and logistics. However, the restrictions on the multiple vehicles’ service sequences and non-linearity caused by the late penalties make it time-consuming to solve this problem. To overcome this difficulty, we propose a novel reinforcement learning framework inspired by transformer architecture to generate tours instantly after offline training. This framework, as trained through the policy gradient method, consists of the information encoder process which can extract the coupling relationships among the pickup and delivery customers, and the decoder process with multi-vehicle attention network to allocate reasonable orders to each vehicle. Validated on Sioux Falls network, the proposed method yields the improvement of 2.4%-8.0% on the solution quality compared with Google OR-Tools and several heuristic algorithms. Notably, the baselines require dozens of minutes to achieve a lesser result on the case with 100 customers while the well-trained model based on our method can be deployed to provide a high-quality solution within seconds. Furthermore, the proposed model also shows good generalization ability in different scenarios with various scale problems, and the obtained results are shown to be quite robust to counter the fluctuation of travel time.
Article
One program to rule them all Computers can beat humans at increasingly complex games, including chess and Go. However, these programs are typically constructed for a particular game, exploiting its properties, such as the symmetries of the board on which it is played. Silver et al. developed a program called AlphaZero, which taught itself to play Go, chess, and shogi (a Japanese version of chess) (see the Editorial, and the Perspective by Campbell). AlphaZero managed to beat state-of-the-art programs specializing in these three games. The ability of AlphaZero to adapt to various game rules is a notable step toward achieving a general game-playing system. Science , this issue p. 1140 ; see also pp. 1087 and 1118
Conference Paper
Dynamic treatment recommendation systems based on large-scale electronic health records (EHRs) become a key to successfully improve practical clinical outcomes. Prior relevant studies recommend treatments either use supervised learning (e.g. matching the indicator signal which denotes doctor prescriptions), or reinforcement learning (e.g. maximizing evaluation signal which indicates cumulative reward from survival rates). However, none of these studies have considered to combine the benefits of supervised learning and reinforcement learning. In this paper, we propose Supervised Reinforcement Learning with Recurrent Neural Network (SRL-RNN), which fuses them into a synergistic learning framework. Specifically, SRL-RNN applies an off-policy actor-critic framework to handle complex relations among multiple medications, diseases and individual characteristics. The "actor'' in the framework is adjusted by both the indicator signal and evaluation signal to ensure effective prescription and low mortality. RNN is further utilized to solve the Partially-Observed Markov Decision Process (POMDP) problem due to lack of fully observed states in real world applications. Experiments on the publicly real-world dataset, i.e., MIMIC-3, illustrate that our model can reduce the estimated mortality, while providing promising accuracy in matching doctors' prescriptions.
Article
Model-free deep reinforcement learning has been shown to exhibit good performance in domains ranging from video games to simulated robotic manipulation and locomotion. However, model-free methods are known to perform poorly when the interaction time with the environment is limited, as is the case for most real-world robotic tasks. In this paper, we study how maximum entropy policies trained using soft Q-learning can be applied to real-world robotic manipulation. The application of this method to real-world manipulation is facilitated by two important features of soft Q-learning. First, soft Q-learning can learn multimodal exploration strategies by learning policies represented by expressive energy-based models. Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch. Our experimental evaluation demonstrates that soft Q-learning is substantially more sample efficient than prior model-free deep reinforcement learning methods, and that compositionality can be performed for both simulated and real-world tasks.
Article
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and critic. Our algorithm takes the minimum value between a pair of critics to restrict overestimation and delays policy updates to reduce per-update error. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
Article
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy - that is, succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
Article
Model-free deep reinforcement learning methods have successfully learned complex behavioral strategies for a wide range of tasks, but typically require many samples to achieve good performance. Model-based algorithms, in principle, can provide for much more efficient learning, but have proven difficult to extend to expressive, high-capacity models such as deep neural networks. In this work, we demonstrate that medium-sized neural network models can in fact be combined with model predictive control to achieve excellent sample complexity in a model-based reinforcement learning algorithm, producing stable and plausible gaits to accomplish various complex locomotion tasks. We also propose using deep neural network dynamics models to initialize a model-free learner, in order to combine the sample efficiency of model-based approaches with the high task-specific performance of model-free methods. We perform this pre-initialization by using rollouts from the trained model-based controller as supervision to pre-train a policy, and then fine-tune the policy using a model-free method. We empirically demonstrate that this resulting hybrid algorithm can drastically accelerate model-free learning and outperform purely model-free learners on several MuJoCo locomotion benchmark tasks, achieving sample efficiency gains over a purely model-free learner of 330x on swimmer, 26x on hopper, 4x on half-cheetah, and 3x on ant. Videos can be found at https://sites.google.com/view/mbmf
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conference Paper
Parcel singulation is one of the key technology in the automated logistics process. In order to manipulate parcels independently, we propose to use sparse actuator array, which has advantages of cost reduction while the control of singulation is difficult because of actuator sharing. In this paper, intuitive control is introduced to make a bundle of parcels lined in order of priority, whose assignment function is learned from massive simulated episodes. Numerical simulation validates the proposed algorithm and the results show that the learning improves the singulation performance.
Conference Paper
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released open-source in order to facilitate experimental reproducibility and to encourage adoption by other researchers.
Article
We study dynamical mass measurements of galaxy clusters contaminated by interlopers and show that a modern machine learning (ML) algorithm can predict masses by better than a factor of two compared to a standard scaling relation approach. We create two mock catalogs from Multidark's publicly-available N-body MDPL1 simulation, one with perfect galaxy cluster membership information and the other where a simple cylindrical cut around the cluster center allows interlopers to contaminate the clusters. In the standard approach, we use a power law scaling relation to infer cluster mass from galaxy line of sight (LOS) velocity dispersion. Assuming perfect membership knowledge, this unrealistic case produces a wide fractional mass error distribution, with width = 0.87. Interlopers introduce additional scatter, significantly widening the error distribution further (width = 2.13). We employ the Support Distribution Machine (SDM) class of algorithms to learn from distributions of data to predict single values. Applied to distributions of galaxy observables such as LOS velocity and projected distance from the cluster center, SDM yields better than a factor-of-two improvement (width = 0.67). Remarkably, SDM applied to contaminated clusters is better able to recover masses than even the scaling relation approach applied to uncontaminated clusters. We show that the SDM method more accurately reproduces the cluster mass function, making it a valuable tool for employing cluster observations to evaluate cosmological models.
Article
2014 In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.
Article
This paper applies acceleration/deceleration control-based velocity profiles to an infeed control algorithm for a cross-belt-type sorting system to improve the accuracy and performance of the system's infeed. The velocity profiles are of a trapezoidal shape and often have to be modified to ensure that parcels correctly synchronize with their intended carriers. Under the proposed method, an infeed line can handle up to 5,600 items/h, which indicates a 40% increase in performance in comparison with its existing handling rate of 4,000 items/h. This improvement in performance may lead to a reduction in the number of infeed lines required in a sorting system. The proposed infeed control algorithm is applied to a cross-belt-type sorting system (model name: SCS 1500) manufactured by Vanderlande Industries.
Article
This article describes the use of principles of reinforcement learning to design feedback controllers for discrete- and continuous-time dynamical systems that combine features of adaptive control and optimal control. Adaptive control [1], [2] and optimal control [3] represent different philosophies for designing feedback controllers. Optimal controllers are normally designed of ine by solving Hamilton JacobiBellman (HJB) equations, for example, the Riccati equation, using complete knowledge of the system dynamics. Determining optimal control policies for nonlinear systems requires the offline solution of nonlinear HJB equations, which are often difficult or impossible to solve. By contrast, adaptive controllers learn online to control unknown systems using data measured in real time along the system trajectories. Adaptive controllers are not usually designed to be optimal in the sense of minimizing user-prescribed performance functions. Indirect adaptive controllers use system identification techniques to first identify the system parameters and then use the obtained model to solve optimal design equations [1]. Adaptive controllers may satisfy certain inverse optimality conditions [4].
Conference Paper
We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). For a given total number of observations, we derive convergence rates as an explicit function of the problem difficulty. As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010].
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
An optimistic perspective on offline reinforcement learning
  • Agarwal
Neural network-based control using actor-critic reinforcement learning and grey wolf optimizer with experimental servo system validation
  • Iuliu
VME-transformer: Enhancing visual memory encoding for navigation in interactive environments
  • Shen
Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. A case of study in sentiment analysis
  • Giménez
Scalable deep reinforcement learning for vision-based robotic manipulation
  • Kalashnikov