Tianwei Ni’s research while affiliated with Université de Montréal and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


LSTM hyperparameters used for all experiments.
Transformer hyperparameters used for all experiments.
RL agent hyperparameters used in all experiments.
Training hyperparameters in our experiments.
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
  • Preprint
  • File available

July 2023

·

69 Reads

Tianwei Ni

·

Michel Ma

·

Benjamin Eysenbach

·

Pierre-Luc Bacon

Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capacity of RL algorithms, scaling up to tasks that require memorizing observations 1500 steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design.

Download

Figure 6: Comparison between shared and separate recurrent actor-critic architecture with all the other hyperparameters same, on Semi-Circle, a toy meta RL environment. We show the performance metric (left) and also gradient norm of the RNN encoder(s) (right, in log-scale). For the separate architecture, :critic and :actor refer to the separate RNN in critic and actor networks, respectively.
Figure 10: Learning curves on robust RL environments. We show the average returns (left figures) and worst returns (right figures) from the single best variant of our implementation on recurrent model-free RL, the specialized robust RL method MRPO [45]. Note that our method is much slower than MRPO, so we have to run our method within 3M environment steps. But the results show that our method have much better sample efficiency over MRPO.
Figure 11: Learning curves on generalization in RL environments. We show the interpolation success rates (left figures) and extrapolation success rates (right figures) from the single best variant of our implementation on recurrent model-free RL. We also show the final performance of the specialized method EPOpt-PPO-FF [78] and another recurrent model-free (on-policy) RL method (A2C-RC) copied from the Table 7 & 8 in Packer et al. [69].
Figure 17: Comparison between shared and separate recurrent actor-critic architecture with all the other
Recurrent Model-Free RL is a Strong Baseline for Many POMDPs

October 2021

·

198 Reads

Many problems in RL, such as meta RL, robust RL, and generalization in RL, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques in their respective domains. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs. Code is available at https://github.com/twni2016/pomdp-baselines


Figure 1: Sample TSF game screen (line drawing version, original screen is in black background). Spaceships are labeled as shooter and bait. Entity at center is the rotating fortress with the border around it as the shield. Activation region is the hexagon area around players' spaceships. One projectile (the ellipsoid) is emitted from the shooter towards the fortress, and another one is from the fortress towards the bait. All the entities are within the rectangle map borders.
Figure 2: The flowchart of the proposed adaptive agent architecture. The adaptation module (in dotted border) takes the input of the trajectory at current timestamp, and then assigns the adaptive agent with new policy at next timestamp. The adaptation procedure can be deployed in real-time (online).
Figure 3: Policy representations of each human baits (left) and shooters (right) in the static agent dataset (after PCA dimension reduction). Each colored node in the figures represents the average policy of a human player, while the size indicates her average team performance. Red nodes are reference points of exemplar policies. based on CEM measurement, then assigns the agent with the corresponding complementary policy in the self-play table. One way of verifying this method is to see if human-agent teams performed better when the predicted human policy was closer to the complementary match (i.e. best partner) in the self-play table P. Assuming each human maintains a consistent policy over each 1-min interaction when paired with a specific teammate with static policy, we could then calculate, for each humanagent pair, the similarity between human policy and the best partner policy for the agent that the human was playing with. This "similarity to best partner" quantifies the degree to which a human player is similar to the optimal policy for an agent teammate in our architecture. Correlation analysis shows that "similarity to best partner" is positively correlated with team performance in both bait (r = 0.636, p = .0002) and shooter (r = 0.834, p < .0001) groups. These results in which complementary pairings of the human shooter accounted for 70% of the variance among teams shows the high payoff potentially available from our approach to matching. The result indicates that the complementary policy pairs we found in agentagent self-play can be successfully extended to humanagent teams, and that our proposed architecture is able to accurately identify human policy types and predict team performance.
Figure 4: (Best viewed in color) Average performance of human-agent teams. Solid blue line represents the learning curve of teams in the Adaptive condition, while the other two dashed lines represent Random and Fixed baseline, respectively. Shaded areas indicate one standard error from the mean. Red horizontal lines are the average performance of agent-agent teams in self-play as a reference.
Individualized Mutual Adaptation in Human-Agent Teams

September 2021

·

113 Reads

·

23 Citations

IEEE Transactions on Human-Machine Systems

Huao Li

·

Tianwei Ni

·

Siddharth Agrawal

·

[...]

·

The ability to collaborate with previously unseen human teammates is crucial for artificial agents to be effective in human-agent teams (HATs). Due to individual differences and complex team dynamics, it is hard to develop a single agent policy to match all potential teammates. In this paper, we study both human-human and humanagent teams in a dyadic cooperative task, Team Space Fortress (TSF). Results show that the team performance is influenced by both players’ individual skill level and their ability to collaborate with different teammates by adopting complementary policies. Based on human-human team results, we propose an adaptive agent that identifies different human policies and assigns a complementary partner policy to optimize team performance. The adaptation method relies on a novel similarity metric to infer human policy and then selects the most complementary policy from a pre-trained library of exemplar policies. We conducted human-agent experiments to evaluate the adaptive agent and examine mutual adaptation in humanagent teams. Results show that both human adaptation and agent adaptation contribute to team performance


Figure 1: Sample TSF game screen (line drawing version, original screen is in black background). Spaceships are labeled as shooter and bait. Entity at center is the rotating fortress with the boarder around it as the shield. Activation region is the hexagon area around players' spaceships. Black arrow is a projectile emitted from the shooter towards the fortress. All the entities are within the rectangle map borders.
Figure 2: The flowchart of the proposed adaptive agent architecture. The adaptation module (in dotted boarder) takes the input of the trajectory at current timestamp, and then assigns the adaptive agent with new policy at next timestamp. The adaption procedure can be deployed in real-time (online).
Figure 3: Policy representations of each human baits (left) and shooters (right) in the static agent dataset (after PCA dimension reduction). Each colored node in the figures represents the average policy of a human player, while the size of which indicates his average team performance. Red nodes are reference points of baseline agent polices.
Figure 5: Human-agent team performance when humans paired with adaptive or static agent policies. Error bars represent one standard error away from the mean.
Adaptive Agent Architecture for Real-time Human-Agent Teaming

March 2021

·

165 Reads

Teamwork is a set of interrelated reasoning, actions and behaviors of team members that facilitate common objectives. Teamwork theory and experiments have resulted in a set of states and processes for team effectiveness in both human-human and agent-agent teams. However, human-agent teaming is less well studied because it is so new and involves asymmetry in policy and intent not present in human teams. To optimize team performance in human-agent teaming, it is critical that agents infer human intent and adapt their polices for smooth coordination. Most literature in human-agent teaming builds agents referencing a learned human model. Though these agents are guaranteed to perform well with the learned model, they lay heavy assumptions on human policy such as optimality and consistency, which is unlikely in many real-world scenarios. In this paper, we propose a novel adaptive agent architecture in human-model-free setting on a two-player cooperative game, namely Team Space Fortress (TSF). Previous human-human team research have shown complementary policies in TSF game and diversity in human players' skill, which encourages us to relax the assumptions on human policy. Therefore, we discard learning human models from human data, and instead use an adaptation strategy on a pre-trained library of exemplar policies composed of RL algorithms or rule-based methods with minimal assumptions of human behavior. The adaptation strategy relies on a novel similarity metric to infer human policy and then selects the most complementary policy in our library to maximize the team performance. The adaptive agent architecture can be deployed in real-time and generalize to any off-the-shelf static agents. We conducted human-agent experiments to evaluate the proposed adaptive agent framework, and demonstrated the suboptimality, diversity, and adaptability of human policies in human-agent teams.


Team Synchronization and Individual Contributions in Coop-Space Fortress

December 2020

·

17 Reads

·

3 Citations

Proceedings of the Human Factors and Ergonomics Society Annual Meeting

This work studied human teamwork with a concentration on the influence of team synchronization and in- dividual differences on performance. Human participants were paired to complete collaborative tasks in a simulated game environment, in which they were assigned roles with corresponding responsibilities. Cross- correlation analysis was employed to quantify the degree of team synchronization and time-lag between two teammates’ collective actions. Results showed that team performance is determined by factors at both the individual and team levels. We found interaction effects between team synchronization and individual differences and quantified their contributions to team performance. The application of our research findings and proposed quantitative methods for developing adaptive agents for human-autonomy teaming is discussed.


f-IRL: Inverse Reinforcement Learning via State Marginal Matching

November 2020

·

34 Reads

Imitation learning is well-suited for robotic tasks where it is difficult to directly program the behavior or specify a cost for optimal control. In this work, we propose a method for learning the reward function (and the corresponding policy) to match the expert state density. Our main result is the analytic gradient of any f-divergence between the agent and expert state distribution w.r.t. reward parameters. Based on the derived gradient, we present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent. We show that f-IRL can learn behaviors from a hand-designed target state density or implicitly through expert observations. Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories on IRL benchmarks. Moreover, we show that the recovered reward function can be used to quickly solve downstream tasks, and empirically demonstrate its utility on hard-to-explore tasks and for behavior transfer across changes in dynamics.


Figure 2: Rendering of simulated environments in Ant-v2 (left) and Humanoid-v2 (right). Task name dim(s) dim(a) Description Ant-v2 111 8 Make a 3D four-legged robot walk forward as fast as possible. Hopper-v2 11 3 Make a 2D one-legged robot hop forward as fast as possible. Humanoid-v2 376 17 Make a 3D bipedal robot walk forward as fast as possible. Walker2d-v2 17 6 Make a 2D bipedal robot walk forward as fast as possible.
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient

July 2020

·

206 Reads

Exploration-exploitation dilemma has long been a crucial issue in reinforcement learning. In this paper, we propose a new approach to automatically balance between these two. Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an ``entropy temperature" that balances the original task reward and the policy entropy, and hence controls the trade-off between exploitation and exploration. It is empirically shown that SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2), which uses constrained optimization for automatic adjustment, has some limitations. The core of our method, namely Meta-SAC, is to use metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks, humanoid-v2.

Citations (2)


... Both human and agent team members must demonstrate adaptability -humans being open to working with artificial teammates while understanding their capabilities and limitations, and agents being capable of adjusting their behavior based on team needs and situational demands[40]. This adaptive dynamic, combined with clear performance metrics and continuous feedback mechanisms, enables the team to maintain high performance while effectively responding to changing task requirements and environmental conditions[12].The concept of Autonomous Agent Teammate-likeness (AAT), introduced by[85], offers another perspective on perfect teams through the eyes of human perception. ...

Reference:

Intent Visualization in Human-Agent Teams
Individualized Mutual Adaptation in Human-Agent Teams

IEEE Transactions on Human-Machine Systems

... The effect of the type of avatar may be due to the perception of the subject regarding the agent as a peer when it has similar physical aspects performing the same action. As it is known, people feel more motivated to use avatars when the avatars share similar characteristics with them [45], and humans are more willing to cooperate by synchronizing their starting/ending actions in cooperative spaces when they share similar characteristics [46]. Therefore, in the synchronization experiment, a more similar agent, i.e., the human agent, led to more synchronized movement by the participant. ...

Team Synchronization and Individual Contributions in Coop-Space Fortress
  • Citing Article
  • December 2020

Proceedings of the Human Factors and Ergonomics Society Annual Meeting