This project looks at extending single-agent skills discovery/learning framework (e.g., DADS, DIAYN, etc.) to multi-agent reinforcement learning (MARL), by allowing a team of agents to learn a library of cooperative strategies (i.e., team-level coordination skills). An example is shown above with SMAC (Starcraft) where agents have learned to defeat their enemies in different ways.
Single-agent skills research focuses on learning representations of the environment, ‘primitives’ that enable efficient learning of downstream tasks. This is a form of pre-training, where the aim is to learn useful high-level policies, ‘skills’ (or ‘options’ ) in the form of latent embeddings that condition the agent’s policy to yield temporally extended behaviors, which can be sequenced to complete the task. Skills are often learned in an unsupervised reward-free setting, by designing our own intrinsic reward instead of relying on the environment reward.
Skills enable temporal abstraction, allowing a learning agent to identify, reason about, and sequence high-level milestones, to alleviate issues of delayed rewards in complex, long-term reinforcement learning tasks. A natural extension of this is model-predictive control (MPC) which Dynamics-Aware Unsupervised Discovery of Skills (DADS) has implemented, using a learned world model, which predicts the short-term effect of using each skill from the current state, to let the agent select a skill to enact for a given duration.
Our work aims to allow a team of agents to learn a library of cooperative strategies (i.e., team-level coordination skills) that are selected using a learned world model and executed through a common, decentralized agent policy conditioned on this strategy. Specifically, we propose to learn a long-term world model, which predicts a high-level representation of the system state, after executing a given strategy for several steps, from the current joint observation of all agents. Given a task, the team can then rely on this world model to select/update its strategy, which each agent conditions it’s policy upon to output individual, cooperative actions. This world model also enables us to generate a ‘predictability diversity’ instrinsic reward, similar to DADS as our intrinsic reward function for the action model.
Related publications: