DDPG | Continuous Control with Deep Reinforcement Learning | Reinforcement Learning library
kandi X-RAY | DDPG Summary
kandi X-RAY | DDPG Summary
Reimplementing DDPG from Continuous Control with Deep Reinforcement Learning based on OpenAI Gym and Tensorflow. It is still a problem to implement Batch Normalization on the critic network. However the actor network works well with Batch Normalization. Some Mujoco environments are still unsolved on OpenAI Gym.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Create Q network
- Batch norm layer
- Return a tf Variable
- Creates the network
- Batch normalization layer
- Returns a tf Variable
- Creates the target network
- Evaluate a given transition
- Train the model
- Add an experience
- Return the number of experience
- Return a batch of data from the buffer
- Resets the state
- Creates a target q - norm
- Calculate noise
- Compute the action based on the current policy
- Generate the noise
- Returns the action of the actor
DDPG Key Features
DDPG Examples and Code Snippets
Community Discussions
Trending Discussions on DDPG
QUESTION
Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried: For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?
...ANSWER
Answered 2022-Mar-17 at 08:28I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions. I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.
QUESTION
I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.
I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).
My custom environment is called SacEnv
, and I attempt to create a batched environment like so:
ANSWER
Answered 2022-Feb-19 at 18:11It turns out I neglected to pass batch_size
when initializing the AverageReturnMetric
and AverageEpisodeLengthMetric
instances.
QUESTION
I am training a DDPG agent on my custom environment that I wrote using openai gym. I am getting error during training the model.
When I search for a solution on web, I found that some people who faced similar issue were able to resolve it by initializing the variable.
...ANSWER
Answered 2021-Jun-10 at 07:00For now I was able to solve this error by replacing the imports from keras with imports from tensorflow.keras, although I don't know why keras itseld doesn't work
QUESTION
Background
I'm currently trying to implement a DDPG framework to control a simple car agent. At first, the car agent would only need to learn how to reach the end of a straight path as quickly as possible by adjusting its acceleration. This task was simple enough, so I decided to introduce an additional steering action as well. I updated my observation and action spaces accordingly.
The lines below are the for loop that runs each episode:
...ANSWER
Answered 2021-Jun-05 at 19:06The issue has been resolved thanks to some simple but helpful advice I received on Reddit. I was disrupting the tracking of my variables by making changes using my custom for-loop. I should have used a TensorFlow function instead. The following changes fixed the problem for me:
QUESTION
I want to set "actor_hiddens" a.k.a the hidden layers of the policy network of PPO in Rllib, and be able to set their weights. Is this possible? If yes please tell me how? I know how to do it for DDPG in Rllib, but the problem with PPO is that I can't find the policy network. Thanks.
...ANSWER
Answered 2021-May-28 at 09:59You can always create your own/custom policy network then you have full control over the layers and also the initialization of the weights.
If you want to use the default model you have the following params to adapt it to your needs:
QUESTION
Below is a high level diagram of how my Agent should look like in order to be able to interact with a custom gym environment I made.
States and actionsThe environment has three states [s1, s2, s3] and six actions [a1, a2, a3, a4, a5, a6] states and actions can be any value between 0 and 1
Question:Which algorithms are suitable for my problem ? I am aware that there are algorithms that are good at handling continuous action space like (DDPG, PPO, etc.) but I can't see how they might operate when they should output multiple actions at each time-step. Finally, are there any gym environments that have the described property (multiple actions) and are there any python implementations for solving those particular environments?
...ANSWER
Answered 2021-Mar-02 at 04:01As you mentioned in your question, PPO, DDPG, TRPO, SAC, etc. are indeed suitable for handling continuous action spaces for reinforcement learning problems. These algorithms will give out a vector of size equal to your action dimension and each element in this vector will be a real number instead of a discrete value. Note that stochastic algorithms like PPO will give a multivariate probability distribution from which you sample the actions.
Most of the robotic environments in Mujoco-py, PyBullet, Robosuite, etc. are environment with multiple continuous action spaces. Here the action spaces can be of the form [torque_for_joint_1, torque_for_join_2, ..., torque_for_joint_n]
where torque_for_joint_i can be a real valued number determining by how much would that joint move.
Regarding implementations for solving these environments, robosuite does offer sample solutions for benchmarking the environments with different algorithms. You could also look up stable-baselines or one of the standard RL libraries.
QUESTION
I have an Actor Critic neural network where the Actor is its own class and the Critic is its own class with its own neural network and .forward() function. I then am creating an object of each of these classes in a larger Model class. My setup is as follows:
...ANSWER
Answered 2021-Jan-20 at 19:09Yes, you shouldn't do it like that. What you should do instead, is propagating through parts of the graph.
What the graph containsNow, graph contains both actor
and critic
. If the computations pass through the same part of graph (say, twice through actor), it will raise this error.
And they will, as you clearly use
actor
andcritic
joined with loss value (this line:loss_actor = -self.critic(state, action)
)Different optimizers do not change anything here, as it's
backward
problem (optimizers simply apply calculated gradients onto models)
- This is how to fix it in GANs, but not in this case, see
Actual fix
paragraph below, read on if you are curious about the topic
If part of a neural network (critic
in this case) does not take part in the current optimization step, it should be treated as a constant (and vice versa).
To do that, you could disable gradient
using torch.no_grad
context manager (documentation) and set critic
to eval
mode (documentation), something along those lines:
QUESTION
In the training phase of Deep Deterministic Policy Gradient (DDPG) algorithm, the action selection would be simply
...ANSWER
Answered 2021-Jan-04 at 02:09The actor
usually is a neural network, and the reason of actor's action
restrict in [-1,1] is usually because the output layer of the actor
net using activation function like Tanh
, and one can process this outputs to let action
belong to any range.
The reason of actor
can choose the good action depending on environment, is because in MDP(Markov decision process), the actor
doing trial and error in the environment, and get reward or penalty for actor
doing good or bad, i.e the actor
net get gradients towards better action
.
Note algorithms like PPG, PPO, SAC, DDPG, can guarantee the actor would select the best action for all states in theory! (i.e assume infinite learning time, infinite actor
net capacity, etc.) in practice, there usually no guarantee unless action space is discrete and environment is very simple.
Understand the idea behind RL algorithms will greatly help you understand source codes of those algorithms, after all, code is implementation of the idea.
QUESTION
Here they mention the need to include optim.zero_grad()
when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad()
and would that have the same effect? Or is it necessary to do optim.zero_grad()
. Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad()
and net.zero_grad()
. I am asking because here, line 115 they use net.zero_grad()
and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do net.zero_grad()
as opposed to optim.zero_grad()
.
ANSWER
Answered 2020-May-19 at 22:05net.zero_grad()
sets the gradients of all its parameters (including parameters of submodules) to zero. If you call optim.zero_grad()
that will do the same, but for all parameters that have been specified to be optimised. If you are using only net.parameters()
in your optimiser, e.g. optim = Adam(net.parameters(), lr=1e-3)
, then both are equivalent, since they contain the exact same parameters.
You could have other parameters that are being optimised by the same optimiser, which are not part of net
, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call optim.zero_grad()
to ensure that all parameters that are being optimised, had their gradients set to zero.
Moreover, what happens if I do both?
Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference.
If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?
Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (param.grad
). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added.
For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off optim.zero_grad()
and delaying optim.step()
until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.
That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.
QUESTION
I try to save the model using the saver method (I use the save function in the DDPG class to save), but when restoring the model, the result is far from the one I saved (I save the model when the episodic award is zero, the restor method in the code is commented out ) My code is below with all the features. I use Python 3.7, gym 0.16.0 and TensorFlow version 1.13.1
...ANSWER
Answered 2020-May-05 at 16:54I solved this problem completely by rewriting the code and adding the learning function in a separate session
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install DDPG
You can use DDPG like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page