Conference Paper

Multi-Agent Hierarchical Reinforcement Learning by Integrating Options into MAXQ

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

MAXQ is a new framework for multi-agent reinforcement learning. But the MAXQ framework cannot decompose all subtasks into more refined hierarchies and the hierarchies are difficult to be discovered automatically. In this paper, a multi-agent hierarchical reinforcement learning approach, named OptMAXQ, by integrating Options into MAXQ is presented. In the OptMAXQ framework, the MAXQ framework is used to introduce knowledge into reinforcement learning and the option framework is used to construct hierarchies automatically. The performance of OptMAXQ is demonstrated in two-robot trash collection task and compared with MAXQ. The simulation results show that the OptMAXQ is more practical than MAXQ in partial known environment

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... On HRL in MASs, studies such as [28,29] are notable. In [28], a developed example of MAXQ framework for multi-agent environments is used, that have a hierarchical structure for displacing the coordination data. ...
... In [28], a developed example of MAXQ framework for multi-agent environments is used, that have a hierarchical structure for displacing the coordination data. Shen et al. [29] relies on the proposed method in [28] integrates Options into MAXQ. It uses MAXQ framework to introduce knowledge into the reinforcement learning and uses the Option framework to construct hierarchies automatically. ...
Article
Reinforcement learning (RL) for solving large and complex problems faces the curse of dimensions problem. To overcome this problem, frameworks based on the temporal abstraction have been presented; each having their advantages and disadvantages. This paper proposes a new method like the strategies introduced in the hierarchical abstract machines (HAMs) to create a high-level controller layer of reinforcement learning which uses options. The proposed framework considers a non-deterministic automata as a controller to make a more effective use of temporally extended actions and state space clustering. This method can be viewed as a bridge between option and HAM frameworks, which tries to suggest a new framework to decrease the disadvantage of both by creating connection structures between them and at the same time takes advantages of them. Experimental results on different test environments show significant efficiency of the proposed method.
... Therefore, the options they use are still single-agent options, and the coordination in the multi-agent system can only be shown/utilized in the option-choosing process while not the option discovery process. We can classify these works by the option discovery methods they used: the algorithms in [15], [16] directly defined the options based on their task without the learning process; the algorithms in [17], [5], [6] learned the options based on the task-related reward signals generated by the environment; the algorithm in [7] trained the options based on a reward function that is a weighted summation of the environment reward and the information theoretic reward term proposed in [13]. ...
Preprint
The use of options can greatly accelerate exploration in reinforcement learning, especially when only sparse reward signals are available. While option discovery methods have been proposed for individual agents, in multi-agent reinforcement learning settings, discovering collaborative options that can coordinate the behavior of multiple agents and encourage them to visit the under-explored regions of their joint state space has not been considered. In this case, we propose Multi-agent Deep Covering Option Discovery, which constructs the multi-agent options through minimizing the expected cover time of the multiple agents' joint state space. Also, we propose a novel framework to adopt the multi-agent options in the MARL process. In practice, a multi-agent task can usually be divided into some sub-tasks, each of which can be completed by a sub-group of the agents. Therefore, our algorithm framework first leverages an attention mechanism to find collaborative agent sub-groups that would benefit most from coordinated actions. Then, a hierarchical algorithm, namely HA-MSAC, is developed to learn the multi-agent options for each sub-group to complete their sub-tasks first, and then to integrate them through a high-level policy as the solution of the whole task. This hierarchical option construction allows our framework to strike a balance between scalability and effective collaboration among the agents. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can effectively capture the agent interactions with the attention mechanism, successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher task rewards.
... Therefore, the options they use are still single-agent options, and the coordination in the multi-agent system can only be shown/utilized in the option-choosing process while not the option discovery process. We can classify these works by the option discovery methods they use: the algorithms in [6], [7] directly define the options based on their task without the learning process; the algorithms in [8]- [10] learn the options based on the task-related reward signals generated by the environment; the algorithm in [20] trains the options based on a reward function that is a weighted summation of the environment reward and the information theoretic reward term proposed in [21]. ...
Preprint
Covering option discovery has been developed to improve the exploration of reinforcement learning in single-agent scenarios with sparse reward signals, through connecting the most distant states in the embedding space provided by the Fiedler vector of the state transition graph. However, these option discovery methods cannot be directly extended to multi-agent scenarios, since the joint state space grows exponentially with the number of agents in the system. Thus, existing researches on adopting options in multi-agent scenarios still rely on single-agent option discovery and fail to directly discover the joint options that can improve the connectivity of the joint state space of agents. In this paper, we show that it is indeed possible to directly compute multi-agent options with collaborative exploratory behaviors among the agents, while still enjoying the ease of decomposition. Our key idea is to approximate the joint state space as a Kronecker graph -- the Kronecker product of individual agents' state transition graphs, based on which we can directly estimate the Fiedler vector of the joint state space using the Laplacian spectrum of individual agents' transition graphs. This decomposition enables us to efficiently construct multi-agent joint options by encouraging agents to connect the sub-goal joint states which are corresponding to the minimum or maximum values of the estimated joint Fiedler vector. The evaluation based on multi-agent collaborative tasks shows that the proposed algorithm can successfully identify multi-agent options, and significantly outperforms prior works using single-agent options or no options, in terms of both faster exploration and higher cumulative rewards.
... Moreover, Cao et al. (2020) proposed a potential field hierarchical reinforcement learning approach to improve the cooperation efficiency of multi-AUV in a target searching task. In their method, the multi-agent cooperative MAXQ algorithm was used for hierarchical reinforcement learning (HRL) (Cheng et al., 2007;Li et al., 2010;Shen et al., 2006) and a potential field was used to automatically adjust parameters of HRL. The proposed method was shown to be able to enable multi-AUV to successfully bypass the dynamic and static obstacles and find the nearest target point to each AUV in simulated experiments. ...
Article
Autonomous underwater vehicle plays a more and more important role in the exploration of marine resources. Path planning and obstacle avoidance is the core technology to realize the autonomy of AUV, which will determine the application prospect of AUV. This paper mainly describes the state-of-the-art methods of path planning and obstacle avoidance for AUV and aims to become a starting point for researchers who are initiating their endeavors in this field. Moreover, the objective of this paper is to give a comprehensive overview of work on recent advances and new breakthroughs, also to discuss some future directions worthy to research in this area. The focus of this article is put on these path planning algorithms that deal with constraints and characteristics of AUV and the influence of marine environments. Since most of the time AUV will operate in the environments full of obstacles, we divide path planning methods of AUV into two categories: global path planning with known static obstacles, and local path planning with unknown and dynamic obstacles. We describe the basic principles of each method and survey most related work to them. An in-depth discussion and comparisons between different path planning algorithms are also provided. Lastly, we propose some potential future research directions that are worthy to investigate in this field.
... A medium level of automation is introduced in [59], which proposes defining some basic MAXQ hierarchy to introduce domain knowledge in the system and using options to learn subtasks in some hierarchy level. After constructing a transition-graph, vertices are clustered using an artificial immune network model until a preset number of clusters (options) are discovered. ...
Article
Full-text available
Reinforcement Learning (RL) as a paradigm aims to develop algorithms that allow to train an agent to optimally achieve a goal with minimal feedback information about the desired behavior, which is not precisely specified. Scalar rewards are returned to the agent as response to its actions endorsing or opposing them. RL algorithms have been successfully applied to robot control design. The extension of the RL paradigm to cope with the design of control systems for Multi-Component Robotic Systems (MCRS) poses new challenges, mainly related to coping with scaling up of complexity due to the exponential state space growth, coordination issues, and the propagation of rewards among agents. In this paper, we identify the main issues which offer opportunities to develop innovative solutions towards fully-scalable cooperative multi-agent systems. Keywordsreinforcement learning–multi-component robotic systems–multi-agent systems
Article
In order to accomplish the target hunting by multi-AUV (multiple autonomous underwater vehicles) in 3-D underwater environments, the AUVs need to cooperate in the process of pursuing and capturing intelligent targets. To improve the efficiency of target hunting and the smoothness of AUV’s trajectory, a fuzzy-based potential field hierarchical reinforcement learning (FPHRL) approach is proposed in this paper. Unlike other algorithms that need repeated training in the choice of parameters, the proposed approach automatically acquires all the required parameters by learning. The potential field hierarchy is established by combining the segmental options with the traditional hierarchy reinforcement learning (HRL) algorithm. The potential field is applied in the parameters of the HRL, which provides a reasonable path for target hunting in an undeveloped environments. In the meantime, fuzzy algorithm is introduced to improve the smoothness of AUV trajectory. The simulation results show that the proposed method can control multi-AUV to achieve multi-target hunting task, and has higher efficiency and adaptability than other algorithm (particle swarm optimization algorithm, bio-inspired neural network algorithm). At the same time, the fuzzy obstacle-avoidance has a certain improvement on trajectory smoothness.
Article
Multiple autonomous underwater vehicles (Multi-AUV) target search is the important element to realize underwater rescue, underwater detection. To improve efficiency of multi-AUV target search in three dimensional (3-D) underwater environments, a potential field hierarchical reinforcement learning (PHRL) approach is proposed in this paper. Unlike other algorithms that need repeated training in the choice of parameters, the proposed approach obtains all the required parameters automatically through learning. By integrating segmental options with the traditional hierarchy reinforcement learning (HRL) algorithm, the potential field hierarchy is built. The potential field is implemented in the parameters of the HRL, which provides with reasonable paths of the target search for the unexplored environments. In search tasks, the designed method can control the multi-AUV system to find the target effectively. The simulation results show that the proposed approach is capable of controlling multi-AUV to achieve search task of multiple targets with higher efficiency and adaptability compared with the HRL algorithm and the lawn-mowing algorithm.
Article
The incorporation of macro-actions (temporally extended actions) into multi-agent decision problems has the potential to address the curse of dimensionality associated with such decision problems. Since macro-actions last for stochastic durations, multiple agents executing decentralized policies in cooperative environments must act asynchronously. We present an algorithm that modifies Generalized Advantage Estimation for temporally extended actions, allowing a state-of-the-art policy optimization algorithm to optimize policies in Dec-POMDPs in which agents act asynchronously. We show that our algorithm is capable of learning optimal policies in two cooperative domains, one involving real-time bus holding control and one involving wildfire fighting with unmanned aircraft. Our algorithm works by framing problems as "event-driven decision processes," which are scenarios where the sequence and timing of actions and events are random and governed by an underlying stochastic process. In addition to optimizing policies with continuous state and action spaces, our algorithm also facilitates the use of event-driven simulators, which do not require time to be discretized into time-steps. We demonstrate the benefit of using event-driven simulation in the context of multiple agents taking asynchronous actions. We show that fixed time-step simulation risks obfuscating the sequence in which closely-separated events occur, adversely affecting the policies learned. Additionally, we show that arbitrarily shrinking the time-step scales poorly with the number of agents.
Article
For large-scale or complex systems with stochastic dynamic programming, we can refer to hierarchical reinforcement learning (HRL) to overcome the curse of dimensionality and the curse of modeling according to their hierarchical structures or hierarchical control modes. HRL belongs to the methodology of sample data-driven optimization, and due to the introduction of spatial or temporal abstraction mechanism, it can be used to accelerate the process of policy learning. The Option method is one of the HRL techniques which can decompose the task of the system into multiple subtasks for learning and implementation. The traditional Option methods are based on discrete-time semi-Markov decision process (SMDP) with discounted criteria, which cannot apply to continuous-time infinite tasks. Therefore, in this paper, we extend the existing Option algorithms to continuous-time case by utilizing relative learning formula of continuous-time SMDPs, and propose a unified online Option algorithm that applies to either average or discounted criteria. The algorithm is under the framework of performance potential theory and continuous-time SMDP model. Finally, we illustrate the effectiveness of the proposed HRL algorithm in solving the optimization problem of continuous-time infinite tasks by a robotic garbage collection system. The simulation results show that it needs less memory, and has better optimization performance and faster learning speed than a continuous-time flat Q-learning algorithm based on simulated annealing technique.
Article
Reinforcement learning is a good method for multi-robot systems to handle tasks in unknown environments or with obscure models. MAXQ is a hierarchical reinforcement learning algorithm, which is limited by some inherent problems. In addition, much research has focused on the completion of the task, rather than the ability to deal with new tasks. In this paper, an improved MAXQ approach is adopted to tune the parameters of the cooperation rules. The proposed scheme is applied to target searching tasks by multi-robots. The simulation results demonstrate the effectiveness and efficiency of the proposed scheme.
Conference Paper
Effective cooperation of multi-robots in unknown environments is essential in many robotic applications, such as environment exploration and target searching. In this paper, a combined hierarchical reinforcement learning approach, together with a designed cooperation strategy, is proposed for the real-time cooperation of multi-robots in completely unknown environments. Unlike other algorithms that need an explicit environment model or select parameters by trial and error, the proposed cooperation method obtains all the required parameters automatically through learning. By integrating segmental options with the traditional MAXQ algorithm, the cooperation hierarchy is built. In new tasks, the designed cooperation method can control the multi-robot system to complete the task effectively. The simulation results demonstrate that the proposed scheme is able to effectively and efficiently lead a team of robots to cooperatively accomplish target searching tasks in completely unknown environments.
Article
Multiagent systems are increasingly present in computational environments. However, the problem of agent design or control is an open research field. Reinforcement learning approaches offer solutions that allow autonomous learning with minimal supervision. The Q-learning algorithm is a model-free reinforcement learning solution that has proven its usefulness in single-agent domains; however, it suffers from dimensionality curse when applied to multiagent systems. In this article, we discuss two approaches, namely TRQ-learning and distributed Q-learning, that overcome the limitations of Q-learning offering feasible solutions. We test these approaches in two separate domains. The first is the control of a hose by a team of robots. The second is the trash disposal problem. Computational results show the effectiveness of Q-learning solutions to multiagent systems’ control.
Article
Full-text available
Hierarchical reinforcement learning facilitates faster learning by structuring the policy space, encouraging reuse of subtasks in different con-texts, and enabling more effective state abstrac-tion. In this paper, we explore another source of power of hierarchies, namely facilitating sharing of subtask value functions across multiple agents. We show that, when combined with suitable co-ordination information, this approach can signif-icantly speed up learning and make it more scal-able with the number of agents. We introduce the multi-agent shared hierarchy (MASH) frame-work, which generalizes the MAXQ framework and allows selectively sharing subtask value functions across agents. We develop a model-based average-reward reinforcement learning al-gorithm for the MASH framework and show its effectiveness with empirical results in a multi-agent taxi domain.
Article
Full-text available
In this paper we investigate the use of hierarchical reinforcement learning to speed up the acquisition of cooperative multi-agent tasks. We extend the MAXQ framework to the multi-agent case. Each agent uses the same MAXQ hierarchy to decompose a task into sub-tasks. Learning is decentralized, with each agent learning three interrelated skills: how to perform subtasks, which order to do them in, and how to coordinate with other agents. Coordination skills among agents are learned by using joint actions at the highest level(s) of the hierarchy. The Q nodes at the highest level(s) of the hierarchy are configured to represent the joint task-action space among multiple agents. In this approach, each agent only knows what other agents are doing at the level of sub-tasks, and is unaware of lower level (primitive) actions. This hierarchical approach allows agents to learn coordination faster by sharing information at the level of sub-tasks, rather than attempting to learn coordination taking into account primit ive joint state-action values. We apply this hierarchical multi-agent reinforcement learning algorithm to a complex AGV scheduling task and compare its performance and speed with other learning approaches, including flat multi-agent, single agent using MAXQ, selfish multiple agents using MAXQ (where each agent acts independently without communicating with the other agents), as well as several well-known AGV heuristics like "first come first serve", "highest queue first" and "nearest station first". We also compare the tradeoffs in learning speed vs. performance of modeling joint action values at multiple levels in the MAXQ hierarchy.
Conference Paper
Full-text available
This paper explores basic aspects of the immune system and proposes a novel immune network model with the main goals of clustering and filtering unlabelled numerical data sets. It is not our concern to reproduce with confidence any immune phenomenon, but to show that immune concepts can be used to develop powerful computational tools for data processing. As important results of our model, the network evolved will be capable of reducing redundancy, describing data structure, including the shape of the clusters. The network will be implemented in association with a statistical inference technique, and its performance will be illustrated using two benchmark problems. The paper is concluded with a trade-off between the proposed network and artificial neural networks used to perform unsupervised learning
Article
Full-text available
An open problem in reinforcement learning is discovering hierarchical structure. HEXQ, an algorithm which automatically attempts to decompose and solve a model-free factored MDP hierarchically is described. By searching for aliased Markov sub-space regions based on the state variables the algorithm uses temporal and state abstraction to construct a hierarchy of interlinked smaller MDPs.
Article
This paper presents the MAXQ approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges wih probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.
Article
This paper describes a formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain. The methodology involves minimizing the learning space through the use of behaviors and conditions, and dealing with the credit assignment problem through shaped reinforcement in the form of heterogeneous reinforcement functions and progress estimators. We experimentally validate the approach on a group of four mobile robots learning a foraging task.
Article
Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting. 1
Article
Hierarchical Control and Learning for Markov Decision Processes by Ronald Edward Parr Doctor of Philosophy in Computer Science University of California at Berkeley Professor Stuart Russell, Chair This dissertation investigates the use of hierarchy and problem decomposition as a means of solving large, stochastic, sequential decision problems. These problems are framed as Markov decision problems (MDPs). The new technical content of this dissertation begins with a discussion of the concept of temporal abstraction. Temporal abstraction is shown to be equivalent to the transformation of a policy defined over a region of an MDP to an action in a semi-Markov decision problem (SMDP). Several algorithms are presented for performing this transformation efficiently. This dissertation introduces the HAM method for generating hierarchical, temporally abstract actions. This method permits the partial specification of abstract actions in a way that corresponds to an abstract plan or strate...
Article
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options---closed-loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning.