Conference Paper

Controlling Tiltrotors Unmanned Aerial Vehicles (UAVs) with Deep Reinforcement Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The PPO algorithm, after the introduction of DAGOA, slightly lags behind SAC in terms of improvement, although the score is improved under each number of users. This may be related to the characteristics of the PPO algorithm itself; e.g., it focuses more on strategy optimization using known information and is relatively weak in global search and adaptive tuning [37]. Nevertheless, the addition of DAGOA still significantly improves the performance of PPO in a multi-user environment, suggesting that its enhancement of algorithm performance is of general interest. ...
Article
Full-text available
With the continuous progress of UAV technology and the rapid development of mobile edge computing (MEC), the UAV-assisted MEC system has shown great application potential in special fields such as disaster rescue and emergency response. However, traditional deep reinforcement learning (DRL) decision-making methods suffer from limitations such as difficulty in balancing multiple objectives and training convergence when making mixed action space decisions for UAV path planning and task offloading. This article innovatively proposes a hybrid decision framework based on the improved Dynamic Adaptive Genetic Optimization Algorithm (DAGOA) and soft actor–critic with hierarchical action decomposition, an uncertainty-quantified critic ensemble, and adaptive entropy temperature, where DAGOA performs an effective search and optimization in discrete action space, while SAC can perform fine control and adjustment in continuous action space. By combining the above algorithms, the joint optimization of drone path planning and task offloading can be achieved, improving the overall performance of the system. The experimental results show that the framework offers significant advantages in improving system performance, reducing energy consumption, and enhancing task completion efficiency. When the system adopts a hybrid decision framework, the reward score increases by a maximum of 153.53% compared to pure deep reinforcement learning algorithms for decision-making. Moreover, it can achieve an average improvement of 61.09% on the basis of various reinforcement learning algorithms such as proposed SAC, proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), and twin delayed deep deterministic policy gradient (TD3).
... The subsequent presentation included the outcomes of two practical experiments, aimed at validating the system's proficiency in tasks such as maintaining a stationary position and following a predefined path. Another study [36] also investigated the use of PPO and TD3 for UAV control, comparing their performance in terms of stability, robustness, and trajectory accuracy across various UAV designs and scenarios. The results demonstrate that both algorithms effectively manage UAV control challenges in dynamic environments. ...
Article
Full-text available
The popularity of quadrotor Unmanned Aerial Vehicles (UAVs) stems from their simple propulsion systems and structural design. However, their complex and nonlinear dynamic behavior presents a significant challenge for control, necessitating sophisticated algorithms to ensure stability and accuracy in flight. Various strategies have been explored by researchers and control engineers, with learning-based methods like reinforcement learning, deep learning, and neural networks showing promise in enhancing the robustness and adaptability of quadrotor control systems. This paper investigates a Reinforcement Learning (RL) approach for both high and low-level quadrotor control systems, focusing on attitude stabilization and position tracking tasks. A novel reward function and actor-critic network structures are designed to stimulate high-order observable states, improving the agent’s understanding of the quadrotor’s dynamics and environmental constraints. To address the challenge of RL hyperparameter tuning, a new framework is introduced that combines Simulated Annealing (SA) with a reinforcement learning algorithm, specifically Simulated Annealing-Twin Delayed Deep Deterministic Policy Gradient (SA-TD3). This approach is evaluated for path-following and stabilization tasks through comparative assessments with two commonly used control methods: Backstepping and Sliding Mode Control (SMC). While the implementation of the well-trained agents exhibited unexpected behavior during real-world testing, a reduced neural network used for altitude control was successfully implemented on a Parrot Mambo mini drone. The results showcase the potential of the proposed SA-TD3 framework for real-world applications, demonstrating improved stability and precision across various test scenarios and highlighting its feasibility for practical deployment.
Article
Full-text available
Deep reinforcement learning (RL) has made it possible to solve complex robotics problems using neural networks as function approximators. However, the policies trained on stationary environments suffer in terms of generalization when transferred from one environment to another. In this work, we use Robust Markov Decision Processes (RMDP) to train the drone control policy, which combines ideas from Robust Control and RL. It opts for pessimistic optimization to handle potential gaps between policy transfer from one environment to another. The trained control policy is tested on the task of quadcopter positional control. RL agents were trained in a MuJoCo simulator. During testing, different environment parameters (unseen during the training) were used to validate the robustness of the trained policy for transfer from one environment to another. The robust policy outperformed the standard agents in these environments, suggesting that the added robustness increases generality and can adapt to nonstationary environments.
Conference Paper
Full-text available
In this paper, we present a novel developmental reinforcement learning-based controller for a quadcopter with thrust vectoring capabilities. This multirotor UAV design has tilt-enabled rotors. It utilizes the rotor force magnitude and direction to achieve the desired state during flight. The control policy of this robot is learned using the policy transfer from the learned controller of the quadcopter (comparatively simple UAV design without thrust vectoring). This approach allows learning a control policy for systems with multiple inputs and multiple outputs. The performance of the learned policy is evaluated by physics-based simulations for the tasks of hovering and way-point navigation. The flight simulations utilize a flight controller based on reinforcement learning without any additional PID components. The results show faster learning with the presented approach as opposed to learning the control policy from scratch for this new UAV design created by modifications in a conventional quadcopter, i.e., the addition of more degrees of freedom (4-actuators in conventional quadcopter to 8-actuators in tilt-rotor quadcopter). We demonstrate the robustness of our learned policy by showing the recovery of the tilt-rotor platform in the simulation from various non-static initial conditions in order to reach a desired state. The developmental policy for the tilt-rotor UAV also showed superior fault tolerance when compared with the policy learned from the scratch. The results show the ability of the presented approach to bootstrap the learned behavior from a simpler system (lower-dimensional action-space) to a more complex robot (comparatively higher-dimensional action-space) and reach better performance faster.
Article
Full-text available
The present work is a review of unmanned aerial systems technology and their subsystems (frame, propellers, motors and batteries, payloads, and data processing). Different applications are evaluated, related to remote sensing, spraying of liquids, and logistics. An overview of the regulatory framework is also developed.
Conference Paper
Full-text available
We describe a new physics engine tailored to model-based control. Multi-joint dynamics are represented in generalized coordinates and computed via recursive algorithms. Contact responses are computed via efficient new algorithms we have developed, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers. Models are specified using either a high-level C++ API or an intuitive XML file format. A built-in compiler transforms the user model into an optimized data structure used for runtime computation. The engine can compute both forward and inverse dynamics. The latter are well-defined even in the presence of contacts and equality constraints. The model can include tendon wrapping as well as actuator activation states (e.g. pneumatic cylinders or muscles). To facilitate optimal control applications and in particular sampling and finite differencing, the dynamics can be evaluated for different states and controls in parallel. Around 400,000 dynamics evaluations per second are possible on a 12-core machine, for a 3D homanoid with 18 dofs and 6 active contacts. We have already used the engine in a number of control applications. It will soon be made publicly available.
Article
Full-text available
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002. Includes bibliographical references (leaves 143-147). Many complex decision making problems like scheduling in manufacturing systems, portfolio management in finance, admission control in communication networks etc., with clear and precise objectives, can be formulated as stochastic dynamic programming problems in which the objective of decision making is to maximize a single "overall" reward. In these formulations, finding an optimal decision policy involves computing a certain "value function" which assigns to each state the optimal reward one would obtain if the system was started from that state. This function then naturally prescribes the optimal policy, which is to take decisions that drive the system to states with maximum value. For many practical problems, the computation of the exact value function is intractable, analytically and numerically, due to the enormous size of the state space. Therefore one has to resort to one of the following approximation methods to find a good sub-optimal policy: (1) Approximate the value function. (2) Restrict the search for a good policy to a smaller family of policies. In this thesis, we propose and study actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies. Actor-critic algorithms have two learning units: an actor and a critic. An actor is a decision maker with a tunable parameter. A critic is a function approximator. The critic tries to approximate the value function of the policy used by the actor, and the actor in turn tries to improve its policy based on the current approximation provided by the critic. Furthermore, the critic evolves on a faster time-scale than the actor. (cont.) We propose several variants of actor-critic algorithms. In all the variants, the critic uses Temporal Difference (TD) learning with linear function approximation. Some of the variants are inspired by a new geometric interpretation of the formula for the gradient of the overall reward with respect to the actor parameters. This interpretation suggests a natural set of basis functions for the critic, determined by the family of policies parameterized by the actor's parameters. We concentrate on the average expected reward criterion but we also show how the algorithms can be modified for other objective criteria. We prove convergence of the algorithms for problems with general (finite, countable, or continuous) state and decision spaces. To compute the rate of convergence (ROC) of our algorithms, we develop a general theory of the ROC of two-time-scale algorithms and we apply it to study our algorithms. In the process, we study the ROC of TD learning and compare it with related methods such as Least Squares TD (LSTD). We study the effect of the basis functions used for linear function approximation on the ROC of TD. We also show that the ROC of actor-critic algorithms does not depend on the actual basis functions used in the critic but depends only on the subspace spanned by them and study this dependence. Finally, we compare the performance of our algorithms with other algorithms that optimize over a parameterized family of policies. We show that when only the "natural" basis functions are used for the critic, the rate of convergence of the actor- critic algorithms is the same as that of certain stochastic gradient descent algorithms ... by Vijaymohan Konda. Ph.D.
Article
Full-text available
In this paper, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actorcritic algorithms for Markov decision processes with general state and action spaces. We state and prove two results regarding their convergence.
Article
Recently, technologies related to Unmanned Aerial Vehicle (UAV) are growing rapidly particularly sensors, networking, processing technologies. Accordingly, governments and industrials have heavily invested in the UAVs studies and the improvements in their performances for reliable and secure deployments. The design and the investigation of UAVs systems progress from mono-UAV uses to multi-UAVs and cooperative UAVs systems that need a high level of coordination and collaboration to perform tasks which require new networking models, approaches, and mechanisms for highly mobile nodes and many constraints. In this context, this paper will give more details and will provide deep investigation on the UAVs communication protocols, networking systems, architectures, and applications. Furthermore, we will discuss UAVs associated solutions and we will highlight technical challenges and UAVs open research issues.
Article
Drones are rapidly becoming an affordable and often faster solution for parcel delivery than terrestrial vehicles. Existing transportation drones and software infrastructures are mostly conceived by logistics companies for trained users and dedicated infrastructure to be used either for long range (<150km) or last-mile delivery (<20 km). This paper presents Dronistics, an integrated software and hardware system for last-centimetre (<5 km) person-to-person drone delivery. The system is conceived to be intrinsically safe and operationally intuitive in order to enable short distance deliveries between inexperienced users. Dronistics is composed of a safe foldable drone (PackDrone) and a web application software to intuitively control and track the drone in real time. In order to assess Dronistics' user acceptance, we conducted 150 deliveries over one month on the EPFL campus in Switzerland. Here we describe the results of these tests by analysing flight statistics, environmental conditions, and user reactions. Moreover, we also describe technical problems that occurred during flight tests and solutions that could prevent them.
Article
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and critic. Our algorithm takes the minimum value between a pair of critics to restrict overestimation and delays policy updates to reduce per-update error. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
Conference Paper
The recent terror incidents make us question if the existing surveillance systems are still the best solution. CCTVs have proven to be a solution for large scale surveillance but when it comes to solving crimes, CCTVs have played a very minimal role. The concept that is proposed in this paper is an idea that is set to overcome these shortcomings and revolutionize the surveillance systems. Based on the framework of a quadcopter with autonomous flight capabilities and auto-tracking feature, the drone uses image processing algorithm of Probability Hypothesis Density (PHD) filtering using a Markov Chain Monte Carlo (MCMC) implementation. To efficiently control the swarm of quadcopters we use an Energy Efficient Coverage Path Planning (EECPP) problem. The concept explained in this paper integrates a swarm of drones which can act autonomously with Image processing and can be the key for the future of public monitoring and security when made into a full scale device, saving precious lives at times.
Article
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Conference Paper
Drones are taking flight, through services in packaging and delivering, reconnaissance, video, film, and public services. It is only a matter of time before drones are fully integrated to enhance the entertainment experience at a certain venue on social gathering. This paper explores the concept and develops an abstract design of a confetti drone for “dronetainment” - a robotic unmanned aircraft system (UAS) that dispenses particles of debris. Its specifications were identified through a group 6-3-5 method that led to a model being implemented by taking 3 steps of a system design approach. A confetti canister was introduced to test the feasibility of a drone dispensing paper particles.