procgen | Procgen Benchmark : Procedurally-Generated Game | Reinforcement Learning library

 by   openai C++ Version: 0.10.7 License: MIT

kandi X-RAY | procgen Summary

procgen is a C++ library typically used in Artificial Intelligence, Reinforcement Learning applications. procgen has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.
16 simple-to-use procedurally-generated gym environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills. The environments run at high speed (thousands of steps per second) on a single core.
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        procgen has a medium active ecosystem.
                        summary
                        It has 828 star(s) with 175 fork(s). There are 114 watchers for this library.
                        summary
                        It had no major release in the last 12 months.
                        summary
                        There are 10 open issues and 56 have been closed. On average issues are closed in 39 days. There are 2 open pull requests and 0 closed requests.
                        summary
                        It has a neutral sentiment in the developer community.
                        summary
                        The latest version of procgen is 0.10.7
                        procgen Support
                          Best in #Reinforcement Learning
                            Average in #Reinforcement Learning
                            procgen Support
                              Best in #Reinforcement Learning
                                Average in #Reinforcement Learning

                                  kandi-Quality Quality

                                    summary
                                    procgen has 0 bugs and 0 code smells.
                                    procgen Quality
                                      Best in #Reinforcement Learning
                                        Average in #Reinforcement Learning
                                        procgen Quality
                                          Best in #Reinforcement Learning
                                            Average in #Reinforcement Learning

                                              kandi-Security Security

                                                summary
                                                procgen has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                procgen code analysis shows 0 unresolved vulnerabilities.
                                                summary
                                                There are 0 security hotspots that need review.
                                                procgen Security
                                                  Best in #Reinforcement Learning
                                                    Average in #Reinforcement Learning
                                                    procgen Security
                                                      Best in #Reinforcement Learning
                                                        Average in #Reinforcement Learning

                                                          kandi-License License

                                                            summary
                                                            procgen is licensed under the MIT License. This license is Permissive.
                                                            summary
                                                            Permissive licenses have the least restrictions, and you can use them in most projects.
                                                            procgen License
                                                              Best in #Reinforcement Learning
                                                                Average in #Reinforcement Learning
                                                                procgen License
                                                                  Best in #Reinforcement Learning
                                                                    Average in #Reinforcement Learning

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        procgen releases are available to install and integrate.
                                                                        summary
                                                                        Installation instructions, examples and code snippets are available.
                                                                        summary
                                                                        It has 970 lines of code, 48 functions and 17 files.
                                                                        summary
                                                                        It has high code complexity. Code complexity directly impacts maintainability of the code.
                                                                        procgen Reuse
                                                                          Best in #Reinforcement Learning
                                                                            Average in #Reinforcement Learning
                                                                            procgen Reuse
                                                                              Best in #Reinforcement Learning
                                                                                Average in #Reinforcement Learning
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
                                                                                  Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  procgen Key Features

                                                                                  Faster: Gym Retro environments are already fast, but Procgen environments can run >4x faster.
                                                                                  Randomized: Gym Retro environments are always the same, so you can memorize a sequence of actions that will get the highest reward. Procgen environments are randomized so this is not possible.
                                                                                  Customizable: If you install from source, you can perform experiments where you change the environments, or build your own environments. The environment-specific code for each environment is often less than 300 lines. This is almost impossible with Gym Retro.
                                                                                  Windows 10
                                                                                  macOS 10.14 (Mojave), 10.15 (Catalina)
                                                                                  Linux (manylinux2010)
                                                                                  3.7 64-bit
                                                                                  3.8 64-bit
                                                                                  3.9 64-bit
                                                                                  3.10 64-bit
                                                                                  Must have at least AVX

                                                                                  procgen Examples and Code Snippets

                                                                                  No Code Snippets are available at this moment for procgen.
                                                                                  Community Discussions

                                                                                  Trending Discussions on Reinforcement Learning

                                                                                  Keras: AttributeError: 'Adam' object has no attribute '_name'
                                                                                  chevron right
                                                                                  What are vectorized environments in reinforcement learning?
                                                                                  chevron right
                                                                                  How does a gradient backpropagates through random samples?
                                                                                  chevron right
                                                                                  Relationship of Horizon and Discount factor in Reinforcement Learning
                                                                                  chevron right
                                                                                  OpenAI-Gym and Keras-RL: DQN expects a model that has one dimension for each action
                                                                                  chevron right
                                                                                  gym package not identifying ten-armed-bandits-v0 env
                                                                                  chevron right
                                                                                  ValueError: Input 0 of layer "max_pooling2d" is incompatible with the layer: expected ndim=4, found ndim=5. Full shape received: (None, 3, 51, 39, 32)
                                                                                  chevron right
                                                                                  Stablebaselines3 logging reward with custom gym
                                                                                  chevron right
                                                                                  What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?
                                                                                  chevron right
                                                                                  DQN predicts same action value for every state (cart pole)
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  Keras: AttributeError: 'Adam' object has no attribute '_name'
                                                                                  Asked 2022-Apr-16 at 15:05

                                                                                  I want to compile my DQN Agent but I get error: AttributeError: 'Adam' object has no attribute '_name',

                                                                                  DQN = buildAgent(model, actions)
                                                                                  DQN.compile(Adam(lr=1e-3), metrics=['mae'])
                                                                                  

                                                                                  I tried adding fake _name but it doesn't work, I'm following a tutorial and it works on tutor's machine, it's probably some new update change but how to fix this

                                                                                  Here is my full code:

                                                                                  from keras.layers import Dense, Flatten
                                                                                  import gym
                                                                                  from keras.optimizer_v1 import Adam
                                                                                  from rl.agents.dqn import DQNAgent
                                                                                  from rl.policy import BoltzmannQPolicy
                                                                                  from rl.memory import SequentialMemory
                                                                                  
                                                                                  env = gym.make('CartPole-v0')
                                                                                  states = env.observation_space.shape[0]
                                                                                  actions = env.action_space.n
                                                                                  
                                                                                  episodes = 10
                                                                                  
                                                                                  def buildModel(statez, actiones):
                                                                                      model = Sequential()
                                                                                      model.add(Flatten(input_shape=(1, statez)))
                                                                                      model.add(Dense(24, activation='relu'))
                                                                                      model.add(Dense(24, activation='relu'))
                                                                                      model.add(Dense(actiones, activation='linear'))
                                                                                      return model
                                                                                  
                                                                                  model = buildModel(states, actions)
                                                                                  
                                                                                  def buildAgent(modell, actionz):
                                                                                      policy = BoltzmannQPolicy()
                                                                                      memory = SequentialMemory(limit=50000, window_length=1)
                                                                                      dqn = DQNAgent(model=modell, memory=memory, policy=policy, nb_actions=actionz, nb_steps_warmup=10, target_model_update=1e-2)
                                                                                      return dqn
                                                                                  
                                                                                  DQN = buildAgent(model, actions)
                                                                                  DQN.compile(Adam(lr=1e-3), metrics=['mae'])
                                                                                  DQN.fit(env, nb_steps=50000, visualize=False, verbose=1)
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Apr-16 at 15:05

                                                                                  Your error came from importing Adam with from keras.optimizer_v1 import Adam, You can solve your problem with tf.keras.optimizers.Adam from TensorFlow >= v2 like below:

                                                                                  (The lr argument is deprecated, it's better to use learning_rate instead.)

                                                                                  # !pip install keras-rl2
                                                                                  import tensorflow as tf
                                                                                  from keras.layers import Dense, Flatten
                                                                                  import gym
                                                                                  from rl.agents.dqn import DQNAgent
                                                                                  from rl.policy import BoltzmannQPolicy
                                                                                  from rl.memory import SequentialMemory
                                                                                  
                                                                                  env = gym.make('CartPole-v0')
                                                                                  states = env.observation_space.shape[0]
                                                                                  actions = env.action_space.n
                                                                                  episodes = 10
                                                                                  
                                                                                  def buildModel(statez, actiones):
                                                                                      model = tf.keras.Sequential()
                                                                                      model.add(Flatten(input_shape=(1, statez)))
                                                                                      model.add(Dense(24, activation='relu'))
                                                                                      model.add(Dense(24, activation='relu'))
                                                                                      model.add(Dense(actiones, activation='linear'))
                                                                                      return model
                                                                                  
                                                                                  def buildAgent(modell, actionz):
                                                                                      policy = BoltzmannQPolicy()
                                                                                      memory = SequentialMemory(limit=50000, window_length=1)
                                                                                      dqn = DQNAgent(model=modell, memory=memory, policy=policy, 
                                                                                                     nb_actions=actionz, nb_steps_warmup=10, 
                                                                                                     target_model_update=1e-2)
                                                                                      return dqn
                                                                                  
                                                                                  model = buildModel(states, actions)
                                                                                  DQN = buildAgent(model, actions)
                                                                                  DQN.compile(tf.keras.optimizers.Adam(learning_rate=1e-3), metrics=['mae'])
                                                                                  DQN.fit(env, nb_steps=50000, visualize=False, verbose=1)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71894769

                                                                                  QUESTION

                                                                                  What are vectorized environments in reinforcement learning?
                                                                                  Asked 2022-Mar-25 at 10:37

                                                                                  I'm having a hard time wrapping my head around what and when vectorized environments should be used. If you can provide an example of a use case, that would be great.

                                                                                  Documentation of vectorized environments in SB3: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-25 at 10:37

                                                                                  Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of executing and training an agent on 1 environment per step, it allows to train the agent on multiple environments per step.

                                                                                  Usually you also want these environment to have different seeds, in order to gain more diverse experience. This is very useful to speed up training.

                                                                                  I think they are called "vectorized" since each training step the agent observes multiple states (inserted in a vector), outputs multiple actions (one for each environment), which are inserted in a vector, and receives multiple rewards. Hence the "vectorized" term

                                                                                  Source https://stackoverflow.com/questions/71549439

                                                                                  QUESTION

                                                                                  How does a gradient backpropagates through random samples?
                                                                                  Asked 2022-Mar-25 at 03:06

                                                                                  I'm learning about policy gradients and I'm having hard time understanding how does the gradient passes through a random operation. From here: It is not possible to directly backpropagate through random samples. However, there are two main methods for creating surrogate functions that can be backpropagated through.

                                                                                  They have an example of the score function:

                                                                                  probs = policy_network(state)
                                                                                  # Note that this is equivalent to what used to be called multinomial
                                                                                  m = Categorical(probs)
                                                                                  action = m.sample()
                                                                                  next_state, reward = env.step(action)
                                                                                  loss = -m.log_prob(action) * reward
                                                                                  loss.backward()
                                                                                  

                                                                                  Which I tried to create an example of:

                                                                                  import torch
                                                                                  import torch.nn as nn
                                                                                  import torch.optim as optim
                                                                                  from torch.distributions import Normal
                                                                                  import matplotlib.pyplot as plt
                                                                                  from tqdm import tqdm
                                                                                  
                                                                                  softplus = torch.nn.Softplus()
                                                                                  
                                                                                  class Model_RL(nn.Module):
                                                                                      def __init__(self):
                                                                                          super(Model_RL, self).__init__()
                                                                                          self.fc1 = nn.Linear(1, 20)
                                                                                          self.fc2 = nn.Linear(20, 30)
                                                                                          self.fc3 = nn.Linear(30, 2)
                                                                                  
                                                                                      def forward(self, x):
                                                                                          x1 = self.fc1(x)
                                                                                          x = torch.relu(x1)
                                                                                          x2 = self.fc2(x)
                                                                                          x = torch.relu(x2)
                                                                                          x3 = softplus(self.fc3(x))
                                                                                          return x3, x2, x1
                                                                                  
                                                                                  # basic 
                                                                                  
                                                                                  net_RL = Model_RL()
                                                                                  
                                                                                  features = torch.tensor([1.0]) 
                                                                                  x = torch.tensor([1.0]) 
                                                                                  y = torch.tensor(3.0)
                                                                                  
                                                                                  baseline = 0
                                                                                  baseline_lr = 0.1
                                                                                  
                                                                                  epochs = 3
                                                                                  
                                                                                  opt_RL = optim.Adam(net_RL.parameters(), lr=1e-3)
                                                                                  losses = []
                                                                                  xs = []
                                                                                  for _ in tqdm(range(epochs)):
                                                                                      out_RL = net_RL(x)
                                                                                      mu, std = out_RL[0]
                                                                                      dist = Normal(mu, std)
                                                                                      print(dist)
                                                                                      a = dist.sample()
                                                                                      log_p = dist.log_prob(a)
                                                                                      
                                                                                      out = features * a
                                                                                      reward = -torch.square((y - out))
                                                                                      baseline = (1-baseline_lr)*baseline + baseline_lr*reward
                                                                                      
                                                                                      loss = -(reward-baseline)*log_p
                                                                                  
                                                                                      opt_RL.zero_grad()
                                                                                      loss.backward()
                                                                                      opt_RL.step()
                                                                                      losses.append(loss.item())
                                                                                  

                                                                                  This seems to work magically fine which again, I don't understand how the gradient passes through as they mentioned that it can't pass through the random operation (but then somehow it does).

                                                                                  Now since the gradient can't flow through the random operation I tried to replace mu, std = out_RL[0] with mu, std = out_RL[0].detach() and that caused the error: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. If the gradient doesn't pass through the random operation, I don't understand why would detaching a tensor before the operation matter.

                                                                                  ANSWER

                                                                                  Answered 2021-Nov-30 at 05:48

                                                                                  It is indeed true that sampling is not a differentiable operation per se. However, there exist two (broad) ways to mitigate this - [1] The REINFORCE way and [2] The reparameterization way. Since your example is related to [1], I will stick my answer to REINFORCE.

                                                                                  What REINFORCE does is it entirely gets rid of sampling operation in the computation graph. However, the sampling operation remains outside the graph. So, your statement

                                                                                  .. how does the gradient passes through a random operation ..

                                                                                  isn't correct. It does not pass through any random operation. Let's see your example

                                                                                  mu, std = out_RL[0]
                                                                                  dist = Normal(mu, std)
                                                                                  a = dist.sample()
                                                                                  log_p = dist.log_prob(a)
                                                                                  

                                                                                  Computation of a does not involve creating a computation graph. It is technically equivalent to plugging in some offline data from a dataset (as in supervised learning)

                                                                                  mu, std = out_RL[0]
                                                                                  dist = Normal(mu, std)
                                                                                  # a = dist.sample()
                                                                                  a = torch.tensor([1.23, 4.01, -1.2, ...], device='cuda')
                                                                                  log_p = dist.log_prob(a)
                                                                                  

                                                                                  Since we don't have offline data beforehand, we create them on the fly and the .sample() method does merely that.

                                                                                  So, there is no random operation on the graph. The log_p depends on mu and std deterministically, just like any standard computation graph. If you cut the connection like this

                                                                                  mu, std = out_RL[0].detach()
                                                                                  

                                                                                  .. of course it is going to complaint.

                                                                                  Also, do not get confused by this operation

                                                                                  dist = Normal(mu, std)
                                                                                  log_p = dist.log_prob(a)
                                                                                  

                                                                                  as it does not contain any randomness by itself. This is merely a shortcut for writing the tedious log-likelihood formula for Normal distribution.

                                                                                  Source https://stackoverflow.com/questions/70163823

                                                                                  QUESTION

                                                                                  Relationship of Horizon and Discount factor in Reinforcement Learning
                                                                                  Asked 2022-Mar-13 at 17:50

                                                                                  What is the connection between discount factor gamma and horizon in RL.

                                                                                  What I have learned so far is that the horizon is the agent`s time to live. Intuitively, agents with finite horizon will choose actions differently than if it has to live forever. In the latter case, the agent will try to maximize all the expected rewards it may get far in the future.

                                                                                  But the idea of the discount factor is also the same. Are the values of gamma near zero makes the horizon finite?

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-13 at 17:50

                                                                                  Horizon refers to how many steps into the future the agent cares about the reward it can receive, which is a little different from the agent's time to live. In general, you could potentially define any arbitrary horizon you want as the objective. You could define a 10 step horizon, in which the agent makes a decision that will enable it to maximize the reward it will receive in the next 10 time steps. Or we could choose a 100, or 1000, or n step horizon!

                                                                                  Usually, the n-step horizon is defined using n = 1 / (1-gamma). Therefore, 10 step horizon will be achieved using gamma = 0.9, while 100 step horizon can be achieved with gamma = 0.99

                                                                                  Therefore, any value of gamma less than 1 imply that the horizon is finite.

                                                                                  Source https://stackoverflow.com/questions/71459191

                                                                                  QUESTION

                                                                                  OpenAI-Gym and Keras-RL: DQN expects a model that has one dimension for each action
                                                                                  Asked 2022-Mar-02 at 10:55

                                                                                  I am trying to set a Deep-Q-Learning agent with a custom environment in OpenAI Gym. I have 4 continuous state variables with individual limits and 3 integer action variables with individual limits.

                                                                                  Here is the code:

                                                                                  #%% import 
                                                                                  from gym import Env
                                                                                  from gym.spaces import Discrete, Box, Tuple
                                                                                  import numpy as np
                                                                                  
                                                                                  
                                                                                  #%%
                                                                                  class Custom_Env(Env):
                                                                                  
                                                                                      def __init__(self):
                                                                                          
                                                                                         # Define the state space
                                                                                         
                                                                                         #State variables
                                                                                         self.state_1 = 0
                                                                                         self.state_2 =  0
                                                                                         self.state_3 = 0
                                                                                         self.state_4_currentTimeSlots = 0
                                                                                         
                                                                                         #Define the gym components
                                                                                         self.action_space = Box(low=np.array([0, 0, 0]), high=np.array([10, 20, 27]), dtype=np.int)    
                                                                                                                                                               
                                                                                         self.observation_space = Box(low=np.array([20, -20, 0, 0]), high=np.array([22, 250, 100, 287]),dtype=np.float16)
                                                                                  
                                                                                      def step(self, action ):
                                                                                  
                                                                                          # Update state variables
                                                                                          self.state_1 = self.state_1 + action [0]
                                                                                          self.state_2 = self.state_2 + action [1]
                                                                                          self.state_3 = self.state_3 + action [2]
                                                                                  
                                                                                          #Calculate reward
                                                                                          reward = self.state_1 + self.state_2 + self.state_3
                                                                                         
                                                                                          #Set placeholder for info
                                                                                          info = {}    
                                                                                          
                                                                                          #Check if it's the end of the day
                                                                                          if self.state_4_currentTimeSlots >= 287:
                                                                                              done = True
                                                                                          if self.state_4_currentTimeSlots < 287:
                                                                                              done = False       
                                                                                          
                                                                                          #Move to the next timeslot 
                                                                                          self.state_4_currentTimeSlots +=1
                                                                                  
                                                                                          state = np.array([self.state_1,self.state_2, self.state_3, self.state_4_currentTimeSlots ])
                                                                                  
                                                                                          #Return step information
                                                                                          return state, reward, done, info
                                                                                          
                                                                                      def render (self):
                                                                                          pass
                                                                                      
                                                                                      def reset (self):
                                                                                         self.state_1 = 0
                                                                                         self.state_2 =  0
                                                                                         self.state_3 = 0
                                                                                         self.state_4_currentTimeSlots = 0
                                                                                         state = np.array([self.state_1,self.state_2, self.state_3, self.state_4_currentTimeSlots ])
                                                                                         return state
                                                                                  
                                                                                  #%% Set up the environment
                                                                                  env = Custom_Env()
                                                                                  
                                                                                  #%% Create a deep learning model with keras
                                                                                  
                                                                                  
                                                                                  from tensorflow.keras.models import Sequential
                                                                                  from tensorflow.keras.layers import Dense, Flatten
                                                                                  from tensorflow.keras.optimizers import Adam
                                                                                  
                                                                                  def build_model(states, actions):
                                                                                      model = Sequential()
                                                                                      model.add(Dense(24, activation='relu', input_shape=states))
                                                                                      model.add(Dense(24, activation='relu'))
                                                                                      model.add(Dense(actions[0] , activation='linear'))
                                                                                      return model
                                                                                  
                                                                                  states = env.observation_space.shape 
                                                                                  actions = env.action_space.shape 
                                                                                  print("env.observation_space: ", env.observation_space)
                                                                                  print("env.observation_space.shape : ", env.observation_space.shape )
                                                                                  print("action_space: ", env.action_space)
                                                                                  print("action_space.shape : ", env.action_space.shape )
                                                                                  
                                                                                  
                                                                                  model = build_model(states, actions)
                                                                                  print(model.summary())
                                                                                  
                                                                                  #%% Build Agent wit Keras-RL
                                                                                  from rl.agents import DQNAgent
                                                                                  from rl.policy import BoltzmannQPolicy
                                                                                  from rl.memory import SequentialMemory
                                                                                  
                                                                                  def build_agent (model, actions):
                                                                                      policy = BoltzmannQPolicy()
                                                                                      memory = SequentialMemory(limit = 50000, window_length=1)
                                                                                      dqn = DQNAgent (model = model, memory = memory, policy=policy,
                                                                                                      nb_actions=actions, nb_steps_warmup=10, target_model_update= 1e-2)
                                                                                      return dqn
                                                                                  
                                                                                  dqn = build_agent(model, actions)
                                                                                  dqn.compile(Adam(lr=1e-3), metrics = ['mae'])
                                                                                  dqn.fit (env, nb_steps = 4000, visualize=False, verbose = 1)
                                                                                  

                                                                                  When I run this code I get the following error message

                                                                                  ValueError: Model output "Tensor("dense_23/BiasAdd:0", shape=(None, 3), dtype=float32)" has invalid shape. DQN expects a model that has one dimension for each action, in this case (3,).
                                                                                  

                                                                                  thrown by the line dqn = DQNAgent (model = model, memory = memory, policy=policy, nb_actions=actions, nb_steps_warmup=10, target_model_update= 1e-2)

                                                                                  Can anyone tell me, why this problem is occuring and how to solve this issue? I assume it has something to do with the built model and thus with the action and state spaces. But I could not figure out what exactly the problem is.

                                                                                  Reminder on the bounty: My bounty is expiring quite soon and unfortunately, I still have not received any answer. If you at least have a guess how to tackle that problem, I'll highly appreciate if you share your thoughts with me and I would be quite thankful for it.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-23 at 11:19

                                                                                  As we talked about in the comments, it seems that the Keras-rl library is no longer supported (the last update in the repository was in 2019), so it's possible that everything is inside Keras now. I take a look at Keras documentation and there are no high-level functions to build a reinforcement learning model, but is possible to use lower-level functions to this.

                                                                                  • Here is an example of how to use Deep Q-Learning with Keras: link

                                                                                  Another solution may be to downgrade to Tensorflow 1.0 as it seems the compatibility problem occurs due to some changes in version 2.0. I didn't test, but maybe the Keras-rl + Tensorflow 1.0 may work.

                                                                                  There is also a branch of Keras-rl to support Tensorflow 2.0, the repository is archived, but there is a chance that it will work for you

                                                                                  Source https://stackoverflow.com/questions/70261352

                                                                                  QUESTION

                                                                                  gym package not identifying ten-armed-bandits-v0 env
                                                                                  Asked 2022-Feb-08 at 08:01

                                                                                  Environment:

                                                                                  • Python: 3.9
                                                                                  • OS: Windows 10

                                                                                  When I try to create the ten armed bandits environment using the following code the error is thrown not sure of the reason.

                                                                                  import gym
                                                                                  import gym_armed_bandits
                                                                                  
                                                                                  env = gym.make('ten-armed-bandits-v0')
                                                                                  

                                                                                  The error:

                                                                                  ---------------------------------------------------------------------------
                                                                                  KeyError                                  Traceback (most recent call last)
                                                                                  File D:\00_PythonEnvironments\01_RL\lib\site-packages\gym\envs\registration.py:158, in EnvRegistry.spec(self, path)
                                                                                      157 try:
                                                                                  --> 158     return self.env_specs[id]
                                                                                      159 except KeyError:
                                                                                      160     # Parse the env name and check to see if it matches the non-version
                                                                                      161     # part of a valid env (could also check the exact number here)
                                                                                  
                                                                                  KeyError: 'ten-armed-bandits-v0'
                                                                                  
                                                                                  During handling of the above exception, another exception occurred:
                                                                                  
                                                                                  UnregisteredEnv                           Traceback (most recent call last)
                                                                                  Input In [6], in 
                                                                                  ----> 1 env = gym.make('ten-armed-bandits-v0')
                                                                                  
                                                                                  File D:\00_PythonEnvironments\01_RL\lib\site-packages\gym\envs\registration.py:235, in make(id, **kwargs)
                                                                                      234 def make(id, **kwargs):
                                                                                  --> 235     return registry.make(id, **kwargs)
                                                                                  
                                                                                  File D:\00_PythonEnvironments\01_RL\lib\site-packages\gym\envs\registration.py:128, in EnvRegistry.make(self, path, **kwargs)
                                                                                      126 else:
                                                                                      127     logger.info("Making new env: %s", path)
                                                                                  --> 128 spec = self.spec(path)
                                                                                      129 env = spec.make(**kwargs)
                                                                                      130 return env
                                                                                  
                                                                                  File D:\00_PythonEnvironments\01_RL\lib\site-packages\gym\envs\registration.py:203, in EnvRegistry.spec(self, path)
                                                                                      197     raise error.UnregisteredEnv(
                                                                                      198         "Toytext environment {} has been moved out of Gym. Install it via `pip install gym-legacy-toytext` and add `import gym_toytext` before using it.".format(
                                                                                      199             id
                                                                                      200         )
                                                                                      201     )
                                                                                      202 else:
                                                                                  --> 203     raise error.UnregisteredEnv("No registered env with id: {}".format(id))
                                                                                  
                                                                                  UnregisteredEnv: No registered env with id: ten-armed-bandits-v0
                                                                                  

                                                                                  When I check the environments available, I am able to see it there.

                                                                                  from gym import envs
                                                                                  print(envs.registry.all())
                                                                                  
                                                                                  dict_values([EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v1), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v3), EnvSpec(BipedalWalkerHardcore-v3), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v1), EnvSpec(FrozenLake-v1), EnvSpec(FrozenLake8x8-v1), EnvSpec(CliffWalking-v0), EnvSpec(Taxi-v3), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(HalfCheetah-v3), EnvSpec(Hopper-v2), EnvSpec(Hopper-v3), EnvSpec(Swimmer-v2), EnvSpec(Swimmer-v3), EnvSpec(Walker2d-v2), EnvSpec(Walker2d-v3), EnvSpec(Ant-v2), EnvSpec(Ant-v3), EnvSpec(Humanoid-v2), EnvSpec(Humanoid-v3), EnvSpec(HumanoidStandup-v2), EnvSpec(FetchSlide-v1), EnvSpec(FetchPickAndPlace-v1), EnvSpec(FetchReach-v1), EnvSpec(FetchPush-v1), EnvSpec(HandReach-v0), EnvSpec(HandManipulateBlockRotateZ-v0), EnvSpec(HandManipulateBlockRotateZTouchSensors-v0), EnvSpec(HandManipulateBlockRotateZTouchSensors-v1), EnvSpec(HandManipulateBlockRotateParallel-v0), EnvSpec(HandManipulateBlockRotateParallelTouchSensors-v0), EnvSpec(HandManipulateBlockRotateParallelTouchSensors-v1), EnvSpec(HandManipulateBlockRotateXYZ-v0), EnvSpec(HandManipulateBlockRotateXYZTouchSensors-v0), EnvSpec(HandManipulateBlockRotateXYZTouchSensors-v1), EnvSpec(HandManipulateBlockFull-v0), EnvSpec(HandManipulateBlock-v0), EnvSpec(HandManipulateBlockTouchSensors-v0), EnvSpec(HandManipulateBlockTouchSensors-v1), EnvSpec(HandManipulateEggRotate-v0), EnvSpec(HandManipulateEggRotateTouchSensors-v0), EnvSpec(HandManipulateEggRotateTouchSensors-v1), EnvSpec(HandManipulateEggFull-v0), EnvSpec(HandManipulateEgg-v0), EnvSpec(HandManipulateEggTouchSensors-v0), EnvSpec(HandManipulateEggTouchSensors-v1), EnvSpec(HandManipulatePenRotate-v0), EnvSpec(HandManipulatePenRotateTouchSensors-v0), EnvSpec(HandManipulatePenRotateTouchSensors-v1), EnvSpec(HandManipulatePenFull-v0), EnvSpec(HandManipulatePen-v0), EnvSpec(HandManipulatePenTouchSensors-v0), EnvSpec(HandManipulatePenTouchSensors-v1), EnvSpec(FetchSlideDense-v1), EnvSpec(FetchPickAndPlaceDense-v1), EnvSpec(FetchReachDense-v1), EnvSpec(FetchPushDense-v1), EnvSpec(HandReachDense-v0), EnvSpec(HandManipulateBlockRotateZDense-v0), EnvSpec(HandManipulateBlockRotateZTouchSensorsDense-v0), EnvSpec(HandManipulateBlockRotateZTouchSensorsDense-v1), EnvSpec(HandManipulateBlockRotateParallelDense-v0), EnvSpec(HandManipulateBlockRotateParallelTouchSensorsDense-v0), EnvSpec(HandManipulateBlockRotateParallelTouchSensorsDense-v1), EnvSpec(HandManipulateBlockRotateXYZDense-v0), EnvSpec(HandManipulateBlockRotateXYZTouchSensorsDense-v0), EnvSpec(HandManipulateBlockRotateXYZTouchSensorsDense-v1), EnvSpec(HandManipulateBlockFullDense-v0), EnvSpec(HandManipulateBlockDense-v0), EnvSpec(HandManipulateBlockTouchSensorsDense-v0), EnvSpec(HandManipulateBlockTouchSensorsDense-v1), EnvSpec(HandManipulateEggRotateDense-v0), EnvSpec(HandManipulateEggRotateTouchSensorsDense-v0), EnvSpec(HandManipulateEggRotateTouchSensorsDense-v1), EnvSpec(HandManipulateEggFullDense-v0), EnvSpec(HandManipulateEggDense-v0), EnvSpec(HandManipulateEggTouchSensorsDense-v0), EnvSpec(HandManipulateEggTouchSensorsDense-v1), EnvSpec(HandManipulatePenRotateDense-v0), EnvSpec(HandManipulatePenRotateTouchSensorsDense-v0), EnvSpec(HandManipulatePenRotateTouchSensorsDense-v1), EnvSpec(HandManipulatePenFullDense-v0), EnvSpec(HandManipulatePenDense-v0), EnvSpec(HandManipulatePenTouchSensorsDense-v0), EnvSpec(HandManipulatePenTouchSensorsDense-v1), EnvSpec(CubeCrash-v0), EnvSpec(CubeCrashSparse-v0), EnvSpec(CubeCrashScreenBecomesBlack-v0), EnvSpec(MemorizeDigits-v0), EnvSpec(three-armed-bandits-v0), EnvSpec(five-armed-bandits-v0), EnvSpec(ten-armed-bandits-v0), EnvSpec(MultiarmedBandits-v0)])
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-08 at 08:01

                                                                                  It could be a problem with your Python version: k-armed-bandits library was made 4 years ago, when Python 3.9 didn't exist. Besides this, the configuration files in the repo indicates that the Python version is 2.7 (not 3.9).

                                                                                  If you create an environment with Python 2.7 and follow the setup instructions it works correctly on Windows:

                                                                                  git clone gym_armed_bandits
                                                                                  cd gym_armed_bandits
                                                                                  pip install -e .
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70858340

                                                                                  QUESTION

                                                                                  ValueError: Input 0 of layer "max_pooling2d" is incompatible with the layer: expected ndim=4, found ndim=5. Full shape received: (None, 3, 51, 39, 32)
                                                                                  Asked 2022-Feb-01 at 07:31

                                                                                  I have two different problems occurs at the same time.

                                                                                  I am having dimensionality problems with MaxPooling2d and having same dimensionality problem with DQNAgent.

                                                                                  The thing is, I can fix them seperately but cannot at the same time.

                                                                                  First Problem

                                                                                  I am trying to build a CNN network with several layers. After I build my model, when I try to run it, it gives me an error.

                                                                                  !pip install PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*
                                                                                  !pip install tensorflow gym keras-rl2 gym[atari] keras pyvirtualdisplay 
                                                                                  
                                                                                  from tensorflow.keras.models import Sequential
                                                                                  from tensorflow.keras.layers import Dense, Flatten, Convolution2D, MaxPooling2D, Activation
                                                                                  from keras_visualizer import visualizer 
                                                                                  from tensorflow.keras.optimizers import Adam
                                                                                  
                                                                                  env = gym.make('Boxing-v0')
                                                                                  height, width, channels = env.observation_space.shape
                                                                                  actions = env.action_space.n
                                                                                  
                                                                                  input_shape = (3, 210, 160, 3)   ## input_shape = (batch_size, height, width, channels)
                                                                                  
                                                                                  def build_model(height, width, channels, actions):
                                                                                    model = Sequential()
                                                                                    model.add(Convolution2D(32, (8,8), strides=(4,4), activation="relu", input_shape=input_shape, data_format="channels_last"))
                                                                                    model.add(MaxPooling2D(pool_size=(2, 2), data_format="channels_last"))
                                                                                    model.add(Convolution2D(64, (4,4), strides=(1,1), activation="relu"))
                                                                                    model.add(MaxPooling2D(pool_size=(2, 2), data_format="channels_last"))
                                                                                    model.add(Convolution2D(64, (3,3), activation="relu"))
                                                                                    model.add(Flatten())
                                                                                    model.add(Dense(512, activation="relu"))
                                                                                    model.add(Dense(256, activation="relu"))
                                                                                    model.add(Dense(actions, activation="linear"))
                                                                                    return model
                                                                                  
                                                                                  model = build_model(height, width, channels, actions)
                                                                                  

                                                                                  It gives below error:

                                                                                  ValueError: Input 0 of layer "max_pooling2d_12" is incompatible with the layer: expected ndim=4, found ndim=5. Full shape received: (None, 3, 51, 39, 32)

                                                                                  Second Problem

                                                                                  My input_shape is (3, 210, 160, 3). I am using the first 3 on purpose due to I have to specify the batch_size before. If I do not specify it before and pass it as (210, 160, 3) to the build_model function, below build_agent function gives me an another error:

                                                                                  def build_agent(model, actions):
                                                                                    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr="eps", value_max=1., value_min=.1, value_test=.2, nb_steps=10000)
                                                                                    memory = SequentialMemory(limit=1000, window_length=3)
                                                                                    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                                                                                                   enable_dueling_network=True, dueling_type="avg",
                                                                                                   nb_actions=actions, nb_steps_warmup=1000)
                                                                                    return dqn
                                                                                  
                                                                                  dqn = build_agent(model, actions)
                                                                                  dqn.compile(Adam(learning_rate=1e-4))
                                                                                  
                                                                                  dqn.fit(env, nb_steps=10000, visualize=False, verbose=1)
                                                                                  

                                                                                  ValueError: Error when checking input: expected conv2d_11_input to have 4 dimensions, but got array with shape (1, 3, 210, 160, 3)

                                                                                  Deleting batch size number in the model construction phase, removes the MaxPooling2D incompatibility error but throws DQNAgent dimensionality error. Adding the batch size to the model construction phase removes DQNAgent dimensionality error but throws the MaxPooling2D incompatibility error.

                                                                                  I am really stucked.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-01 at 07:31

                                                                                  Issue is with input_shape. input_shape=input_shape[1:]

                                                                                  Working sample code

                                                                                  from tensorflow.keras.models import Sequential
                                                                                  from tensorflow.keras.layers import Dense, Flatten, Convolution2D, MaxPooling2D, Activation
                                                                                  from tensorflow.keras.optimizers import Adam
                                                                                  
                                                                                  input_shape = (3, 210, 160, 3)
                                                                                  
                                                                                  model = Sequential()
                                                                                  model.add(Convolution2D(32, (8,8), strides=(4,4), activation="relu", input_shape=input_shape[1:], data_format="channels_last"))
                                                                                  model.add(MaxPooling2D(pool_size=(2,2), data_format="channels_last"))
                                                                                  model.add(Convolution2D(64, (4,4), strides=(1,1), activation="relu"))
                                                                                  model.add(MaxPooling2D(pool_size=(2, 2), data_format="channels_last"))
                                                                                  model.add(Convolution2D(64, (3,3), activation="relu"))
                                                                                  model.add(Flatten())
                                                                                  model.add(Dense(512, activation="relu"))
                                                                                  model.add(Dense(256, activation="relu"))
                                                                                  model.add(Dense(2, activation="linear"))
                                                                                  
                                                                                  model.summary()
                                                                                  

                                                                                  Output

                                                                                  Model: "sequential_7"
                                                                                  _________________________________________________________________
                                                                                   Layer (type)                Output Shape              Param #   
                                                                                  =================================================================
                                                                                   conv2d_9 (Conv2D)           (None, 51, 39, 32)        6176      
                                                                                                                                                   
                                                                                   max_pooling2d_5 (MaxPooling  (None, 25, 19, 32)       0         
                                                                                   2D)                                                             
                                                                                                                                                   
                                                                                   conv2d_10 (Conv2D)          (None, 22, 16, 64)        32832     
                                                                                                                                                   
                                                                                   max_pooling2d_6 (MaxPooling  (None, 11, 8, 64)        0         
                                                                                   2D)                                                             
                                                                                                                                                   
                                                                                   conv2d_11 (Conv2D)          (None, 9, 6, 64)          36928     
                                                                                                                                                   
                                                                                   flatten_1 (Flatten)         (None, 3456)              0         
                                                                                                                                                   
                                                                                   dense_4 (Dense)             (None, 512)               1769984   
                                                                                                                                                   
                                                                                   dense_5 (Dense)             (None, 256)               131328    
                                                                                                                                                   
                                                                                   dense_6 (Dense)             (None, 2)                 514       
                                                                                                                                                   
                                                                                  =================================================================
                                                                                  Total params: 1,977,762
                                                                                  Trainable params: 1,977,762
                                                                                  Non-trainable params: 0
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70808035

                                                                                  QUESTION

                                                                                  Stablebaselines3 logging reward with custom gym
                                                                                  Asked 2021-Dec-25 at 01:10

                                                                                  I have this custom callback to log the reward in my custom vectorized environment, but the reward appears in console as always [0] and is not logged in tensorboard at all

                                                                                  class TensorboardCallback(BaseCallback):
                                                                                      """
                                                                                      Custom callback for plotting additional values in tensorboard.
                                                                                      """
                                                                                  
                                                                                      def __init__(self, verbose=0):
                                                                                          super(TensorboardCallback, self).__init__(verbose)
                                                                                  
                                                                                      def _on_step(self) -> bool:                
                                                                                          self.logger.record('reward', self.training_env.get_attr('total_reward'))
                                                                                          return True
                                                                                  

                                                                                  And this is part of the main function

                                                                                  model = PPO(
                                                                                          "MlpPolicy", env,
                                                                                          learning_rate=3e-4,
                                                                                          policy_kwargs=policy_kwargs,
                                                                                          verbose=1,
                                                                                  
                                                                                  # as the environment is not serializable, we need to set a new instance of the environment
                                                                                  loaded_model = model = PPO.load("model", env=env)
                                                                                  loaded_model.set_env(env)
                                                                                  
                                                                                  # and continue training
                                                                                  loaded_model.learn(1e+6, callback=TensorboardCallback())
                                                                                          tensorboard_log="./tensorboard/")
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-25 at 01:10

                                                                                  You need to add [0] as indexing,

                                                                                  so where you wrote self.logger.record('reward', self.training_env.get_attr('total_reward')) you just need to index with self.logger.record('reward', self.training_env.get_attr ('total_reward')[0])

                                                                                  class TensorboardCallback(BaseCallback):
                                                                                      """
                                                                                      Custom callback for plotting additional values in tensorboard.
                                                                                      """
                                                                                  
                                                                                      def __init__(self, verbose=0):
                                                                                          super(TensorboardCallback, self).__init__(verbose)
                                                                                  
                                                                                      def _on_step(self) -> bool:                
                                                                                          self.logger.record('reward', self.training_env.get_attr('total_reward')[0])
                                                                                  
                                                                                          return True
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70468394

                                                                                  QUESTION

                                                                                  What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?
                                                                                  Asked 2021-Dec-23 at 11:07

                                                                                  I followed a PyTorch tutorial to learn reinforcement learning(TRAIN A MARIO-PLAYING RL AGENT) but I am confused about the following code:

                                                                                  current_Q = self.net(state, model="online")[np.arange(0, self.batch_size), action] # Q_online(s,a)
                                                                                  

                                                                                  What's the purpose of [np.arange(0, self.batch_size), action] after the neural network?(I know that TD_estimate takes in state and action, just confused about this on the programming side) What is this usage(put a list after self.net)?

                                                                                  More related code referenced from the tutorial:

                                                                                  class MarioNet(nn.Module):
                                                                                  
                                                                                  def __init__(self, input_dim, output_dim):
                                                                                      super().__init__()
                                                                                      c, h, w = input_dim
                                                                                  
                                                                                      if h != 84:
                                                                                          raise ValueError(f"Expecting input height: 84, got: {h}")
                                                                                      if w != 84:
                                                                                          raise ValueError(f"Expecting input width: 84, got: {w}")
                                                                                  
                                                                                      self.online = nn.Sequential(
                                                                                          nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
                                                                                          nn.ReLU(),
                                                                                          nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
                                                                                          nn.ReLU(),
                                                                                          nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
                                                                                          nn.ReLU(),
                                                                                          nn.Flatten(),
                                                                                          nn.Linear(3136, 512),
                                                                                          nn.ReLU(),
                                                                                          nn.Linear(512, output_dim),
                                                                                      )
                                                                                  
                                                                                      self.target = copy.deepcopy(self.online)
                                                                                  
                                                                                      # Q_target parameters are frozen.
                                                                                      for p in self.target.parameters():
                                                                                          p.requires_grad = False
                                                                                  
                                                                                  def forward(self, input, model):
                                                                                      if model == "online":
                                                                                          return self.online(input)
                                                                                      elif model == "target":
                                                                                          return self.target(input)
                                                                                  

                                                                                  self.net:

                                                                                  self.net = MarioNet(self.state_dim, self.action_dim).float()
                                                                                  

                                                                                  Thanks for any help!

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-23 at 11:07

                                                                                  Essentially, what happens here is that the output of the net is being sliced to get the desired part of the Q table.

                                                                                  The (somewhat confusing) index of [np.arange(0, self.batch_size), action] indexes each axis. So, for axis with index 1, we pick the item indicated by action. For index 0, we pick all items between 0 and self.batch_size.

                                                                                  If self.batch_size is the same as the length of dimension 0 of this array, then this slice can be simplified to [:, action] which is probably more familiar to most users.

                                                                                  Source https://stackoverflow.com/questions/70458347

                                                                                  QUESTION

                                                                                  DQN predicts same action value for every state (cart pole)
                                                                                  Asked 2021-Dec-22 at 15:55

                                                                                  I'm trying to implement a DQN. As a warm up I want to solve CartPole-v0 with a MLP consisting of two hidden layers along with input and output layers. The input is a 4 element array [cart position, cart velocity, pole angle, pole angular velocity] and output is an action value for each action (left or right). I am not exactly implementing a DQN from the "Playing Atari with DRL" paper (no frame stacking for inputs etc). I also made a few non standard choices like putting done and the target network prediction of action value in the experience replay, but those choices shouldn't affect learning.

                                                                                  In any case I'm having a lot of trouble getting the thing to work. No matter how long I train the agent it keeps predicting a higher value for one action over another, for example Q(s, Right)> Q(s, Left) for all states s. Below is my learning code, my network definition, and some results I get from training

                                                                                  class DQN:
                                                                                      def __init__(self, env, steps_per_episode=200):
                                                                                          self.env = env
                                                                                          self.agent_network = MlpPolicy(self.env)
                                                                                          self.target_network = MlpPolicy(self.env)
                                                                                          self.target_network.load_state_dict(self.agent_network.state_dict())
                                                                                          self.target_network.eval()
                                                                                          self.optimizer = torch.optim.RMSprop(
                                                                                              self.agent_network.parameters(), lr=0.005, momentum=0.95
                                                                                          )
                                                                                          self.replay_memory = ReplayMemory()
                                                                                          self.gamma = 0.99
                                                                                          self.steps_per_episode = steps_per_episode
                                                                                          self.random_policy_stop = 1000
                                                                                          self.start_learning_time = 1000
                                                                                          self.batch_size = 32
                                                                                  
                                                                                      def learn(self, episodes):
                                                                                          time = 0
                                                                                          for episode in tqdm(range(episodes)):
                                                                                              state = self.env.reset()
                                                                                              for step in range(self.steps_per_episode):
                                                                                                  if time < self.random_policy_stop:
                                                                                                      action = self.env.action_space.sample()
                                                                                                  else:
                                                                                                      action = select_action(self.env, time, state, self.agent_network)
                                                                                                  new_state, reward, done, _ = self.env.step(action)
                                                                                                  target_value_pred = predict_target_value(
                                                                                                      new_state, reward, done, self.target_network, self.gamma
                                                                                                  )
                                                                                                  experience = Experience(
                                                                                                      state, action, reward, new_state, done, target_value_pred
                                                                                                  )
                                                                                                  self.replay_memory.append(experience)
                                                                                                  if time > self.start_learning_time:  # learning step
                                                                                                      experience_batch = self.replay_memory.sample(self.batch_size)
                                                                                                      target_preds = extract_value_predictions(experience_batch)
                                                                                                      agent_preds = agent_batch_preds(
                                                                                                          experience_batch, self.agent_network
                                                                                                      )
                                                                                                      loss = torch.square(agent_preds - target_preds).sum()
                                                                                                      self.optimizer.zero_grad()
                                                                                                      loss.backward()
                                                                                                      self.optimizer.step()
                                                                                                  if time % 1_000 == 0:  # how frequently to update target net
                                                                                                      self.target_network.load_state_dict(self.agent_network.state_dict())
                                                                                                      self.target_network.eval()
                                                                                  
                                                                                                  state = new_state
                                                                                                  time += 1
                                                                                  
                                                                                                  if done:
                                                                                                      break
                                                                                  
                                                                                  
                                                                                  def agent_batch_preds(experience_batch: list, agent_network: MlpPolicy):
                                                                                      """
                                                                                      Calculate the agent action value estimates using the old states and the
                                                                                      actual actions that the agent took at that step.
                                                                                      """
                                                                                      old_states = extract_old_states(experience_batch)
                                                                                      actions = extract_actions(experience_batch)
                                                                                      agent_preds = agent_network(old_states)
                                                                                      experienced_action_values = agent_preds.index_select(1, actions).diag()
                                                                                      return experienced_action_values
                                                                                  
                                                                                  def extract_actions(experience_batch: list) -> list:
                                                                                      """
                                                                                      Extract the list of actions from experience replay batch and torchify
                                                                                      """
                                                                                      actions = [exp.action for exp in experience_batch]
                                                                                      actions = torch.tensor(actions)
                                                                                      return actions
                                                                                  
                                                                                  class MlpPolicy(nn.Module):
                                                                                      """
                                                                                      This class implements the MLP which will be used as the Q network. I only
                                                                                      intend to solve classic control problems with this.
                                                                                      """
                                                                                  
                                                                                      def __init__(self, env):
                                                                                          super(MlpPolicy, self).__init__()
                                                                                          self.env = env
                                                                                          self.input_dim = self.env.observation_space.shape[0]
                                                                                          self.output_dim = self.env.action_space.n
                                                                                          self.fc1 = nn.Linear(self.input_dim, 32)
                                                                                          self.fc2 = nn.Linear(32, 128)
                                                                                          self.fc3 = nn.Linear(128, 32)
                                                                                          self.fc4 = nn.Linear(32, self.output_dim)
                                                                                  
                                                                                      def forward(self, x):
                                                                                          if type(x) != torch.Tensor:
                                                                                              x = torch.tensor(x).float()
                                                                                          x = F.relu(self.fc1(x))
                                                                                          x = F.relu(self.fc2(x))
                                                                                          x = F.relu(self.fc3(x))
                                                                                          x = self.fc4(x)
                                                                                          return x
                                                                                  

                                                                                  Learning results:

                                                                                  Here I'm seeing one action always valued over the others (Q(right, s) > Q(left, s)). It's also clear that the network is predicting the same action values for every state.

                                                                                  Does anyone have an idea about what's going on? I've done a lot of debugging and careful reading of the original papers (also thought about "normalizing" the observation space even though the velocities can be infinite) and could be missing something obvious at this point. I can include more code for the helper functions if that would be useful.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-19 at 16:09

                                                                                  There was nothing wrong with the network definition. It turns out the learning rate was too high and reducing it 0.00025 (as in the original Nature paper introducing the DQN) led to an agent which can solve CartPole-v0.

                                                                                  That said, the learning algorithm was incorrect. In particular I was using the wrong target action-value predictions. Note the algorithm laid out above does not use the most recent version of the target network to make predictions. This leads to poor results as training progresses because the agent is learning based on stale target data. The way to fix this is to just put (s, a, r, s', done) into the replay memory and then make target predictions using the most up to date version of the target network when sampling a mini batch. See the code below for an updated learning loop.

                                                                                  def learn(self, episodes):
                                                                                          time = 0
                                                                                          for episode in tqdm(range(episodes)):
                                                                                              state = self.env.reset()
                                                                                              for step in range(self.steps_per_episode):
                                                                                                  if time < self.random_policy_stop:
                                                                                                      action = self.env.action_space.sample()
                                                                                                  else:
                                                                                                      action = select_action(self.env, time, state, self.agent_network)
                                                                                                  new_state, reward, done, _ = self.env.step(action)
                                                                                                  experience = Experience(state, action, reward, new_state, done)
                                                                                                  self.replay_memory.append(experience)
                                                                                                  if time > self.start_learning_time:  # learning step.
                                                                                                      experience_batch = self.replay_memory.sample(self.batch_size)
                                                                                                      target_preds = target_batch_preds(
                                                                                                          experience_batch, self.target_network, self.gamma
                                                                                                      )
                                                                                                      agent_preds = agent_batch_preds(
                                                                                                          experience_batch, self.agent_network
                                                                                                      )
                                                                                                      loss = torch.square(agent_preds - target_preds).sum()
                                                                                                      self.optimizer.zero_grad()
                                                                                                      loss.backward()
                                                                                                      self.optimizer.step()
                                                                                                  if time % 1_000 == 0:  # how frequently to update target net
                                                                                                      self.target_network.load_state_dict(self.agent_network.state_dict())
                                                                                                      self.target_network.eval()
                                                                                  
                                                                                                  state = new_state
                                                                                                  time += 1
                                                                                                  if done:
                                                                                                      break
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70382999

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install procgen

                                                                                  First make sure you have a supported version of python:.
                                                                                  If you want to change the environments or create new ones, you should build from source. You can get miniconda from https://docs.conda.io/en/latest/miniconda.html if you don't have it, or install the dependencies from environment.yml manually. On Windows you will also need "Visual Studio 16 2019" installed. The environment code is in C++ and is compiled into a shared library exposing the gym3.libenv C interface that is then loaded by python. The C++ code uses Qt for drawing.

                                                                                  Support

                                                                                  See CONTRIBUTING for information on contributing.
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit
                                                                                  Install
                                                                                • PyPI

                                                                                  pip install procgen

                                                                                • CLONE
                                                                                • HTTPS

                                                                                  https://github.com/openai/procgen.git

                                                                                • CLI

                                                                                  gh repo clone openai/procgen

                                                                                • sshUrl

                                                                                  git@github.com:openai/procgen.git

                                                                                • Share this Page

                                                                                  share link

                                                                                  Reuse Pre-built Kits with procgen

                                                                                  Consider Popular Reinforcement Learning Libraries

                                                                                  Try Top Libraries by openai

                                                                                  gym

                                                                                  by openaiPython

                                                                                  whisper

                                                                                  by openaiPython

                                                                                  openai-cookbook

                                                                                  by openaiJupyter Notebook

                                                                                  gpt-2

                                                                                  by openaiPython

                                                                                  baselines

                                                                                  by openaiPython

                                                                                  Compare Reinforcement Learning Libraries with Highest Support

                                                                                  ml-agents

                                                                                  by Unity-Technologies

                                                                                  gym

                                                                                  by openai

                                                                                  AirSim

                                                                                  by microsoft

                                                                                  acme

                                                                                  by deepmind

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit