Unsupervised SkillDiscovery And SkillLearning In Minecraft

From MotoGP
Jump to: navigation, search

Pre-training Reinforcement Learning agents in a task-agnostic manner has shown promising results. However, previous works still struggle in learning and discovering meaningful skills in high-dimensional state-spaces, such as pixel-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational and contrastive techniques. We demonstrate that both enable RL agents to learn a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D pixel maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. To overcome these limitations, we explore alternative input observations such as the relative position of the agent along with the raw pixels.



Reinforcement Learning, ICML, Unsupervised Learning, Skill-discovery, self-supervised learning, intrinsic motivation, empowerment



Reinforcement Learning (RL) [29] has witnessed a wave of outstanding works in the last decade, with special focus on games (Schrittwieser et al. [27], Vinyals et al. [32], Berner et al. [4]), but also in robotics (Akkaya et al. [2], Hwangbo et al. [18]). In general, these works follow the classic RL paradigm where an agent interacts with an environment performing some action, and in response it receives a reward. These agents are optimized to maximize the expected sum of future rewards.



Rewards are usually handcrafted or overparametrized, and this fact becomes a bottleneck that prevents RL to scale. For this reason, there has been an increasing interest in training agents in a task-agnostic manner during the last few years, making use of intrinsic motivations and unsupervised techniques. Recent works have explored the unsupervised learning paradigm (Campos et al. [7], Gregor et al. [16], Eysenbach et al. [13], Warde-Farley et al. [33], Burda et al. [6], Pathak et al. [24]), but RL is still far from the remarkable results obtained in other domains. For instance, in computer vision, Chen et al. [10] achieve an 81% accuracy on ImageNet training in a self-supervised manner, or Caron et al. [9] achieves state-of-the-art results in image and video object segmentation using Visual Transformers [12] and no labels at all. Also, in natural language processing pre-trained language models such as GPT-3 [5] have become the basis for other downstream tasks.



Humans and animals are sometimes guided through the process of learning. We have good priors that allow us to properly explore our surroundings, which leads to discovering new skills. For machines, learning skills in a task-agnostic manner has proved to be challenging [33, 21]. These works state that training pixel-based RL agents end-to-end is not efficient because learning a good state representation is unfeasible due to the high dimensionality of the observations. Moreover, most of the successes in RL come from training agents during thousands of simulated years (Berner et al. [4]) or millions of games (Vinyals et al. [32]). This learning approach is very sample inefficient and sometimes limits its research because of the high computational budget it may imply. As a response, some benchmarks have been proposed to promote the development of algorithms that can reduce the number of samples needed to solve complex tasks. This is the case of MineRL (Guss et al. [17]) or ProcGen Benchmark (Cobbe et al. [11]).



Our work is inspired by Campos et al. [7] and their Explore, Discover and Learn (EDL) paradigm. EDL relies on empowerment [25] for motivating an agent intrinsically. Empowerment aims to maximize the influence of the agent over the environment while discovering novel skills. As stated by Salge et al. [25], this can be achieved my maximizing the mutual information between sequences of actions and final states. Gregor et al. [16] introduces a novel approach that instead of blindly committing to a sequence of actions, each action depends on the observation from the environment. This is achieved by maximizing the mutual information between inputs and some latent variables. Campos et al. [7] embraces this approach as we do. However, the implementation by Campos et al. [7] makes some assumptions that are not realistic for pixel observations. Due to the Gaussian assumption at the output of variational approaches, the intrinsic reward is computed as the reconstruction error, and in the pixel domain this metric does not necessarily match the distance in the environment space. Therefore, we look for alternatives that suit our requirements: we derive a different reward from the mutual information, and we study alternatives to the variational approach.



This work focuses on learning meaningful representations, discovering skills and training latent-conditioned policies. In any of the cases, our methodology does not require any supervision and works directly from pixel observations. Additionally, we also study the impact of extra input information in the form of position coordinates. Our proposal is tested over the MineRL [17] environment, which is based on the popular Minecraft videogame. Even though the game proposes a final goal, Minecraft is well known by the freedom that it gives to the players, and actually most human players use this freedom to explore this virtual world following their intrinsic motivations. Similarly, we aim at discovering skills in Minecraft without any extrinsic reward.



We generate random trajectories in Minecraft maps with little exploratory challenges, and also study contrastive alternatives that exploit the temporal information throughout a trajectory. The contrastive approach aims at learning an embedding space where observations that are close in time are also close in the embedding space. A similar result can be achieved by leveraging the agents’ relative position in the form of coordinates. In the latter, the objective is to infer skills that do not fully rely on pixel resemblance, but also take into account temporal and spatial relationships.



Our final goal is to discover and learn skills that can be potentially used in more broad and complex tasks. Either by transferring the policy knowledge or by using hierarchical approaches. Some works have already assessed this idea specially in robotics [14] or 2D games [8]. Once the pre-training stage is completed and the agent has learned some basic behaviours or skills, the agent is exposed to an extrinsic reward. These works show how the agents leverage the skill knowledge to learn much faster and encourage proper exploration of the environment in unrelated downstream tasks. However, transferring the policy knowledge is not as straightforward as in other deep learning tasks. If one wants to transfer behaviours (policies), the change in the task might lead to catastrophic forgetting.



Our contributions are the following:



• We demonstrate that variational techniques are not the only ones capable of maximizing the mutual information between inputs and latent variables by leveraging contrastive techniques.



• We provide alternatives for discovering and learning skills in procedurally generated maps by leveraging the agents coordinate information.



• We succesfully implement the reverse form of the mutual information for optimizing pixel-based agents in a complex 3D environment.



Intrinsic Motivations (IM) are very helpful mechanisms to deal with sparse rewards. In some environments the extrinsic rewards are very difficult to obtain and, therefore, the agent does not receive any feedback to progress. In order to drive the learning process without supervision, we can derive intrinsic motivations as proxy rewards that guide the agents towards the extrinsic reward or just towards better exploration.



Skill Discovery. We relate Intrinsic Motivations to the concept of empowerment [25], a RL paradigm in which the agent looks for the states where it has more control over the environment. Mohamed and Rezende [22] derived a variational lower bound on the mutual information that allows to maximize empowerment. Skill discovery extends this idea from the action level to temporally-extended actions. Florensa et al. [14] merges skill discovery and hierarchical architectures. They learn a high-level policy on top of some basic skills learned in a task-agnostic way. They show how this set-up improves the exploration and enables faster training in downstream tasks. Similarly, Achiam et al. [1] emphasize in learning the skills dynamically using a curriculum learning approach, allowing the method to learn up to a hundred of skills. Instead of maximizing the mutual information between states and skills they use skills and whole trajectories. Eysenbach et al. [13] demonstrates that learned skills can serve as an effective pretraining mechanism for robotics. Our work follows their approach regarding the use of a categorical and uniform prior over the latent variables. Campos et al. [7] exposes the lack of coverage of previous works. They propose Explore, Discover and Learn (EDL), a method for skill discovery that breaks the dependency on the distributions induced by the policy. Warde-Farley et al. [33] provides an algorithm for learning goal-conditioned policies using an imitator and a teacher. They demonstrate the effectiveness of their approach in pixel-based environments like Atari, DeepMind Control Suite and DeepMind Lab.



Intrinsic curiosity. In a more broad spectrum we find methods that leverage intrinsic rewards that encourage exploratory behaviours. Pathak et al. [24] present an Intrinsic Curiosity Module that defines the curiosity or reward as the error predicting the consequence of its own actions in a visual feature space. Similarly, Burda et al. [6] uses a Siamese network where one of the encoder tries to predict the output of the other. The bonus reward is computed as the error between the prediction and the random one.



Goal-oriented RL. Many of the works dealing with skill-discovery end up parameterizing a policy. This policy is usually conditioned in some goal or latent variable z∼Zsimilar-to𝑧𝑍z\sim Zitalic_z ∼ italic_Z. The goal-conditioned policy formulation along with function approximation was introduced by Schaul et al. [26] Hindsight Experience Replay (HER) by Andrychowicz et al. [3] allows sample-efficient learning from environments with sparse rewards. HER assumes that any state can be a goal state. Therefore, leveraging this idea, the agent learns from failed trajectories as if the final state achieved was the goal state. Trott et al. [30] proposes to use pairs of trajectories that encourage progress towards a goal while learns to avoid local optima. Among different environments, Sibling Rivalry is evaluated in a 3D construction task in Minecraft. The goal of the agent is to build a structure by placing and removing blocks. They use the number of block-wise differences as a naive shaped reward which causes the agent to avoid placing blocks. Simply by adding Sibling Rivalry they manage to improve the degree of construction accuracy.



Representation Learning in RL. In RL we seek for low-dimensional state representations that preserve all the information and variability of the state space in order to make decisions that eventually maximize the reward. This becomes crucial when dealing with pixel-based environments. Ghosh et al. [15] aims to capture those factors of variation that are important for decision making and are aware of the dynamics of the environment without the need for explicit reconstruction. Lee et al. [21] also makes use of variational inference techniques for estimating the posterior distribution, but breaks the Markovian assumption by conditioning the probability of the latent variables with past observations. Oord et al. [23] leverages powerful autoregressive models and negative sampling to learn a latent space from high-dimensional data. This work introduces the InfoNCE loss based on Noise Contrastive Estimation. The intuition behind is that we learn a latent space that allows to classify correctly positive pairs while discriminating from negative samples. We make use of this loss in our contrastive experiments, as well as the following works explained. Laskin et al. [20] trains an end-to-end model by performing off-policy learning on top of extracted features. This features are computed using a contrastive approach based on pairs of augmented observations, while Stooke et al. [28] picks pairs of delayed observations. All these works learn representations from pixel-based observations, except for the first one by Ghosh et al. [15].



3 Information-theoretic skill discovery



Intrinsic Motivations (IM) can drive the agents learning process in the absence of extrinsic rewards. With IM, the agents do not receive any feedback from the environment, but must autonomously learn the possibilities available in it. To achieve that, the agents aim to gain resources and influence on what can be done in the environment. In the empowerment [25] framework, an agent will look for the state in which has the most control over the environment. This concept usually deals with simple actions. In contrast, skill-discovery temporally abstracts these simple actions to create high-level actions that are dubbed as skills or options. Skill-discovery is formulated as maximizing the mutual information between inputs and skills. This encourages the agent to learn skills that derive as many different input sequences as possible, while avoiding overlap between sequences guided by different skills. Therefore, skill-discovery can ease the exploration in complex downstream tasks. Instead of executing random actions, the agent can take advantage of the learned behaviours in order to perform smarter moves towards states with potential extrinsic rewards.



In the next section we formulate the mathematical framework that is used through our work. First we define the classic Markov decision process that typically provide a mathematical formulation for reinforcement learning tasks. Later on, we introduce the tools from information-theory that allows us to define the maximization of the mutual information.



3.1 Preliminaries



Let us consider a Markov decision process (MDP) without rewards as ℳ=(𝒮,𝒜,𝒫)ℳ𝒮𝒜𝒫\mathcalM=(\mathcalS,\mathcalA,\mathcalP)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P ). Where 𝒮𝒮\mathcalScaligraphic_S is the high-dimensional state space (pixel images), 𝒜𝒜\mathcalAcaligraphic_A refers to the set of actions available in the environment and 𝒫𝒫\mathcalPcaligraphic_P defines the transition probability p(st+1|st,a)𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡𝑎p(s_t+1|s_t,a)italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ). We learn latent-conditioned policies π(a|s,z)𝜋conditional𝑎𝑠𝑧\pi(a|s,z)italic_π ( italic_a | italic_s , italic_z ), where the latent z∈𝒵𝑧𝒵z\in\mathcalZitalic_z ∈ caligraphic_Z is a random variable.



Given the property of symmetry, the mutual information (ℐℐ\mathcalIcaligraphic_I) can be defined using the Shannon Entropy (ℋℋ\mathcalHcaligraphic_H) in two ways (Gregor et al. [16]):



ℐ(S,Z)ℐ𝑆𝑍\displaystyle\mathcalI(S,Z)caligraphic_I ( italic_S , italic_Z ) =ℋ(Z)-ℋ(Z|S) → reverseabsentℋ𝑍ℋconditional𝑍𝑆→reverse\displaystyle=\mathcalH(Z)-\mathcalH(Z|S)\qquad\rightarrow\qquad\textreverse= caligraphic_H ( italic_Z ) - caligraphic_H ( italic_Z | italic_S ) → reverse (1)



=ℋ(S)-ℋ(S|Z) → forwardabsentℋ𝑆ℋconditional𝑆𝑍→forward\displaystyle=\mathcalH(S)-\mathcalH(S|Z)\qquad\rightarrow\qquad\textforward= caligraphic_H ( italic_S ) - caligraphic_H ( italic_S | italic_Z ) → forward



For instance, we derive the reverse form of the mutual information (MI):



ℐ(S,Z)=𝔼s,z∼p(z,s)[logp(z|s)]-𝔼z∼p(z)[logp(z)]ℐ𝑆𝑍subscript𝔼similar-to𝑠𝑧𝑝𝑧𝑠delimited-[]𝑝conditional𝑧𝑠subscript𝔼similar-to𝑧𝑝𝑧delimited-[]𝑝𝑧\mathcalI(S,Z)=\mathbbE_s,z\sim p(z,s)[\log p(z|s)]-\mathbbE_z\sim p(% z)[\log p(z)]caligraphic_I ( italic_S , italic_Z ) = blackboard_E start_POSTSUBSCRIPT italic_s , italic_z ∼ italic_p ( italic_z , italic_s ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_z | italic_s ) ] - blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_p ( italic_z ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_z ) ] (2)



The posterior p(z|s)𝑝conditional𝑧𝑠p(z|s)italic_p ( italic_z | italic_s ) is unknown due to the complexity to marginalize the evidence p(s)=∫p(s|z)p(z)𝑑z𝑝𝑠𝑝conditional𝑠𝑧𝑝𝑧differential-d𝑧p(s)=\int p(s|z)p(z)dzitalic_p ( italic_s ) = ∫ italic_p ( italic_s | italic_z ) italic_p ( italic_z ) italic_d italic_z. Hence, we approximate it with qϕ(z|s)subscript𝑞italic-ϕconditional𝑧𝑠q_\phi(z|s)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_s ) by using variational inference or contrastive techniques. As a result, we aim at maximizing a lower bound based on the KL divergence. For a detailed derivation we refer the reader to the original work from Mohamed and Rezende [22].



Therefore, the MI lower bound becomes:



ℐ(S,Z)≥𝔼s,z∼p(z,s)[logqϕ(z|s)]-𝔼z∼p(z)[logp(z)]ℐ𝑆𝑍subscript𝔼similar-to𝑠𝑧𝑝𝑧𝑠delimited-[]subscript𝑞italic-ϕconditional𝑧𝑠subscript𝔼similar-to𝑧𝑝𝑧delimited-[]𝑝𝑧\mathcalI(S,Z)\geq\mathbbE_s,z\sim p(z,s)[\log q_\phi(z|s)]-\mathbbE% _z\sim p(z)[\log p(z)]caligraphic_I ( italic_S , italic_Z ) ≥ blackboard_E start_POSTSUBSCRIPT italic_s , italic_z ∼ italic_p ( italic_z , italic_s ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_s ) ] - blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_p ( italic_z ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_z ) ] (3)



The prior p(z)𝑝𝑧p(z)italic_p ( italic_z ), can either be learned through the optimization process [16], or fixed uniformly beforehand [13]. In our case, since we want to maximize the uncertainty over p(z)𝑝𝑧p(z)italic_p ( italic_z ), we define a categorical uniform distribution.



3.1.1 Variational Inference



Variational inference is a well-known method for modelling posterior distributions. Its main advantage is that it avoids computing the marginal p(s)𝑝𝑠p(s)italic_p ( italic_s ), which is usually impossible. Instead, one selects some tractable families of distributions q𝑞qitalic_q as an approximation of p𝑝pitalic_p.



p(z|s)≈q(z|θ)𝑝conditional𝑧𝑠𝑞conditional𝑧𝜃p(z|s)\approx q(z|\theta)italic_p ( italic_z | italic_s ) ≈ italic_q ( italic_z | italic_θ ) (4)



We fit q with sample data to learn the distribution parameters θ𝜃\thetaitalic_θ. In particular, we make use of Variational Auto-Encoders (VAE) (Kingma and Welling [19]) with categorical classes, namely VQ-VAE [31], for modelling the mappings from inputs to latents and backwards. The encoder qϕ(z|s)subscript𝑞italic-ϕconditional𝑧𝑠q_\phi(z|s)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_s ) models p(z|s)𝑝conditional𝑧𝑠p(z|s)italic_p ( italic_z | italic_s ) and the decoder qψ(s|z)subscript𝑞𝜓conditional𝑠𝑧q_\psi(s|z)italic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_s | italic_z ) models p(s|z)𝑝conditional𝑠𝑧p(s|z)italic_p ( italic_s | italic_z ).



3.1.2 Contrastive Learning



Contrastive learning is a subtype of self-supervised learning that consists in learning representations from the data by comparison. This field has quickly evolved in the recent years, especially in computer vision. Within this context, the comparisons are between pairs of images. Extremecraft Examples of positive pairs may be augmented versions of a given image like crops, rotations, color transformations, etc; and the negative pairs maye be the other images from the dataset. In this case, representations are learned by training deep neural networks to distinguish between positive or negative pairs of observations. The main diference between contrastive self-supervised (or unsupervised) learning and metric learning is that the former does not require any human annotation, while the later does. Learning representations with no need of labeling data allows scaling up the process and overcoming the main bottleneck when training deep neural networks: the scarcity of supervisory signals.



Similar to what is done in variational techniques, we can compute the mutual information between inputs and latents. In this case, instead of learning the latents by the reconstruction error, they are learned by modeling global features between positive and negative pairs of images. Intuitively, two distinct augmentations (positive pair) should be closer in the embedding space than two distinct images (negative pair) from the dataset. Therefore, we need two encoders in parallel instead of an encoder-decoder architecture. The second encoder is usually called momentum encoder (Chen et al. [10]) and its weights are updated using exponential moving averages of the main encoder.



In each training step, we forward a batch of different original images through the main encoder while we forward the positive pairs of each of those images through the momentum encoder. In a batch of N𝑁Nitalic_N samples, we have for each positive pair, N-1𝑁1N-1italic_N - 1 negative pairs. We define z𝑧zitalic_z as the output of the encoder and z′superscript𝑧′z^\primeitalic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the output of the momentum encoder. Then, e𝑒eitalic_e is a convolutional neural encoder and hℎhitalic_h is a projection head (small multi-layer perceptron) that returns a latent z𝑧zitalic_z, (z=h(eθ(s))𝑧ℎsubscript𝑒𝜃𝑠z=h(e_\theta(s))italic_z = italic_h ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) )).



At the output of the network we can perform a categorical cross-entropy loss where the correct classes are the positive pairs in the batch. The correct classes are in the diagonal of the resulting matrix zTWz′superscript𝑧𝑇𝑊superscript𝑧′z^TWz^\primeitalic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where W𝑊Witalic_W is a projection matrix learned during training. In the other positions, we have the similarity between negative pairs in the embedding space. Minimizing this loss, whose name is InfoNCE [23], will encourage the model to find global features that can be found in augmented versions from an image. Moreover, as stated by their authors Oord et al. [23], minimizing the InfoNCE Eq. 5 loss maximizes the mutual information between inputs and latents. As a result, we learn the desired mapping from input states s𝑠sitalic_s to latent vectors z𝑧zitalic_z.



ℒInfoNCE=logexp(zTWz′)exp(zTWz′)+∑i=0K-1exp(zTWzi′)subscriptℒ𝐼𝑛𝑓𝑜𝑁𝐶𝐸superscript𝑧𝑇𝑊superscript𝑧′superscript𝑧𝑇𝑊superscript𝑧′superscriptsubscript𝑖0𝐾1superscript𝑧𝑇𝑊superscriptsubscript𝑧𝑖′\mathcalL_InfoNCE=\log\frac\exp(z^TWz^\prime)\exp(z^TWz^\prime)% +\sum_i=0^K-1\exp(z^TWz_i^\prime)caligraphic_L start_POSTSUBSCRIPT italic_I italic_n italic_f italic_o italic_N italic_C italic_E end_POSTSUBSCRIPT = roman_log divide start_ARG roman_exp ( italic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_exp ( italic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG (5)



In this section we show the implementation details adopted for each of the stages of our method.



4.1 Exploration



EDL (Explore, Discover and Learn) [7] provides empirical and theoretical analysis of the lack of coverage of some methods leveraging the mutual information for discovering skills. Either by using the forward or the reverse form of the MI, the reward given to novel states is always smaller than the one given to known states. The main problem dwells in the induced state distributions used to maximize the mutual information. Most works reinforce already discovered behaviours since they induce the state distribution from a random policy, p(s)≈ρπ(s)=𝔼z[ρπ(s|z)]𝑝𝑠subscript𝜌𝜋𝑠subscript𝔼𝑧delimited-[]subscript𝜌𝜋conditional𝑠𝑧p(s)\approx\rho_\pi(s)=\mathbbE_z[\rho_\pi(s|z)]italic_p ( italic_s ) ≈ italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s | italic_z ) ]. EDL is agnostic to how p(s)𝑝𝑠p(s)italic_p ( italic_s ) is obtained, so we can infer it in very different ways. One manner would be to induce it by leveraging a dataset of expert trajectories that will encode human-priors that are usually learned poorly with information-theoretic objectives. Another way is to make use of an exploratory policy that is able to induce a uniform distribution over the state space. Since this is not clearly solved in complex 3D environments yet, we explore by using random policies in bounded maps.



4.2 Skill-Discovery



In Section 3, we proposed two distinct approaches for modelling the mapping from the observations s𝑠sitalic_s to the latent variables z𝑧zitalic_z: variational inference and contrastive learning. The skill-discovery pipelines for both approaches are depicted in Figure 1, and are described in the remaining of this section.



4.2.1 Variational Inference



Vector-Quantized VAE (VQ-VAE) [31] is a variational model that allows us to have a categorical distribution p(z)𝑝𝑧p(z)italic_p ( italic_z ) as a bottleneck (also called codebook), an encoder qψsubscript𝑞𝜓q_\psiitalic_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT that estimates the posterior p(z|s)𝑝conditional𝑧𝑠p(z|s)italic_p ( italic_z | italic_s ), and the decoder qϕsubscript𝑞italic-ϕq_\phiitalic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT that estimates p(s|z)𝑝conditional𝑠𝑧p(s|z)italic_p ( italic_s | italic_z ).



Before training, the model requires to fix the length of the codebook. This determines the granularity of the latent variables. If we choose a large number of codes, we will end up with latent variables that encode very similar states. Instead, if we choose a small number, we will encourage the model to find latents that generalize across diverse scenarios. The perplexity metric measures the number of codewords needed to encode our whole state distribution. In practice, this metric is computed per batch during training. Since it is not possible to know beforehand the number of useful latent variables that our model will discover, one can iterate over different codebook lengths until they find a good trade-off between generalization and granularity.



In EDL [7], the purpose of training a variational auto-encoder was to discover the latent variables that condition the policies in the Learning stage. But the learned representations were discarded. Instead, in our case, we also leverage the representations learned in the encoder and decoder for training the RL agent. This allows for faster and more efficient trainings of the RL agents.



4.2.2 Contrastive Learning



In the contrastive case, we follow the idea of Stooke et al. [28] where the positive pairs are delayed observations from a specific trajectory. Once the latents are learned, we define a categorical distribution over latents by clustering the embedding space of the image representations [34]. This step is performed using K-Means with K𝐾Kitalic_K clusters equal to the length of the VQ-VAE codebook.



At this point, we have the same set-up for both the contrastive and variational approaches. The K-Means centroids are equivalent to the VQ-VAE codebook embeddings, so we can maximize the mutual information between the categorical latents and the inputs with the same Equation 2.



4.3 Skill-Learning



In the last stage of the process, we aim to train a policy π(a|s,z)𝜋conditional𝑎𝑠𝑧\pi(a|s,z)italic_π ( italic_a | italic_s , italic_z ) that maximizes the mutual information between inputs and the discovered latent variables. At each episode of the training, we sample a latent variable z∼p(z)similar-to𝑧𝑝𝑧z\sim p(z)italic_z ∼ italic_p ( italic_z ). Then, at each step, this embedding is concatenated with the embedding of the encoded observation at timestep t𝑡titalic_t and forwarded through the network. This process is depicted in Figure 2. At this stage, the latent variables are interpreted as navigation goals. Hence, the agent is encouraged to visit those states that brings it closer to the navigation goal or latent variable. Since each of the latent variables encode different parts of the state distribution, this results in covering different regions for each z𝑧zitalic_z. Our method adopts the reverse form of the mutual information. Since we fix p(z)𝑝𝑧p(z)italic_p ( italic_z ) as a uniform distribution we can remove the logp(z)𝑝𝑧\log p(z)roman_log italic_p ( italic_z ) constant term from the Equation 3, which results in the reward of Equation 7 for the variational case, and Equation 8 for the contrastive case. The agent receives a reward of 1 only if the embedding that is conditioning the policy is the closest to the encoded observation.



r(s,z)=qϕ(z|s)𝑟𝑠𝑧subscript𝑞italic-ϕconditional𝑧𝑠r(s,z)=q_\phi(z|s)italic_r ( italic_s , italic_z ) = italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_s ) (6)



qϕ(z=k|s)=