CartPole | Various DQN method with cartpole
kandi X-RAY | CartPole Summary
kandi X-RAY | CartPole Summary
CartPole
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Build the model
- Sample noise
- Train the trained model
- Returns the action corresponding to the given state
- Add memory to memory
- Loads a model
- Saves weights to the model
- Update the target weights
CartPole Key Features
CartPole Examples and Code Snippets
import gym, torch, numpy as np, torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts
task = 'CartPole-v0'
lr, epoch, batch_size = 1e-3, 10, 64
train_num, test_num = 10, 100
gamma, n_step, target_freq = 0.9, 3, 320
def main():
env = gym.make('CartPole-v0')
D = env.observation_space.shape[0]
K = env.action_space.n
pmodel = PolicyModel(D, K, [])
vmodel = ValueModel(D, [10])
init = tf.global_variables_initializer()
session = tf.InteractiveSession()
def main():
env = gym.make('CartPole-v0')
D = env.observation_space.shape[0]
K = env.action_space.n
pmodel = PolicyModel(D, K, [])
vmodel = ValueModel(D, [10])
gamma = 0.99
if 'monitor' in sys.argv:
filename = os.path.basename(__fi
Community Discussions
Trending Discussions on CartPole
QUESTION
I am working on a DQN training model of the game "CartPole-v1". In this model, the system did not remind any error information in the terminal. However, The result evaluation got worse.This is the output data:
...ANSWER
Answered 2022-Apr-17 at 15:08Check out the code. For most parts it's the same as in snippet above, but there is some changes:
for step in replay buffer (which is called in code
memory_store
) namedtuple is used, and in update it's much easier to readt.reward
, than looking what every index doing in stept
class DQN
has methodupdate
, it's better to keep optimizer as attribute of class, than create it every time when calling functionbackprbgt
usage of
torch.autograd.Variable
here is unnecessary, so it's also was taken awayupdate in
backprbgt
taken per batchdecrease size of hidden layer from 360 to 32, while increase batch size from 40 to 128
updating network once in 10 episodes, but on 10 batches in replay buffer
average score prints out every 50 episodes based on 10 last episodes
add seeds
Also for RL it's take a long time to learn anything, so hoping that after 100 episodes it'll be close to even 100 points is somewhat optimistic. For the code in link averaging on 5 runs results in following dynamics
X axis -- number of episodes (yeah, 70 K, but it's like 20 minutes of real time)
Y axis -- number of steps in episode
As can be seen after 70K episodes algorithm achieves reward comparable to highest possible in this environment (highest -- 500). By tweaking hyperparameters faster rate can be achieved, but also remember it's DQN without any modification.
QUESTION
I want to compile my DQN Agent but I get error:
AttributeError: 'Adam' object has no attribute '_name'
,
ANSWER
Answered 2022-Apr-16 at 15:05Your error came from importing Adam
with from keras.optimizer_v1 import Adam
, You can solve your problem with tf.keras.optimizers.Adam
from TensorFlow >= v2
like below:
(The lr
argument is deprecated, it's better to use learning_rate
instead.)
QUESTION
I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.
I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).
My custom environment is called SacEnv
, and I attempt to create a batched environment like so:
ANSWER
Answered 2022-Feb-19 at 18:11It turns out I neglected to pass batch_size
when initializing the AverageReturnMetric
and AverageEpisodeLengthMetric
instances.
QUESTION
Environment:
- Python: 3.9
- OS: Windows 10
When I try to create the ten armed bandits environment using the following code the error is thrown not sure of the reason.
...ANSWER
Answered 2022-Feb-08 at 08:01It could be a problem with your Python version: k-armed-bandits library was made 4 years ago, when Python 3.9 didn't exist. Besides this, the configuration files in the repo indicates that the Python version is 2.7 (not 3.9).
If you create an environment with Python 2.7 and follow the setup instructions it works correctly on Windows:
QUESTION
I'm trying to implement a DQN. As a warm up I want to solve CartPole-v0 with a MLP consisting of two hidden layers along with input and output layers. The input is a 4 element array [cart position, cart velocity, pole angle, pole angular velocity] and output is an action value for each action (left or right). I am not exactly implementing a DQN from the "Playing Atari with DRL" paper (no frame stacking for inputs etc). I also made a few non standard choices like putting done
and the target network prediction of action value in the experience replay, but those choices shouldn't affect learning.
In any case I'm having a lot of trouble getting the thing to work. No matter how long I train the agent it keeps predicting a higher value for one action over another, for example Q(s, Right)> Q(s, Left) for all states s. Below is my learning code, my network definition, and some results I get from training
...ANSWER
Answered 2021-Dec-19 at 16:09There was nothing wrong with the network definition. It turns out the learning rate was too high and reducing it 0.00025 (as in the original Nature paper introducing the DQN) led to an agent which can solve CartPole-v0.
That said, the learning algorithm was incorrect. In particular I was using the wrong target action-value predictions. Note the algorithm laid out above does not use the most recent version of the target network to make predictions. This leads to poor results as training progresses because the agent is learning based on stale target data. The way to fix this is to just put (s, a, r, s', done)
into the replay memory and then make target predictions using the most up to date version of the target network when sampling a mini batch. See the code below for an updated learning loop.
QUESTION
I am implementing REINFORCE applied to the CartPole-V0 openAI gym environment. I am trying 2 different implementations of the same, and the issue I am not able to resolve is the following:
Upon passing a single state to the Policy Network, I get an output Tensor of size 2, containing the action probabilities of the 2 actions. However, when I pass a `batch of states' to the Policy Network to compute the output action probabilities of all of them, the values that I obtain are very different from when each state is individually passed to the network.
Can someone help me understand the issue?
My code for the same is below: (Note: this is NOT the complete REINFORCE algorithm -- I am aware that I need to compute the loss from the probabilities. But I am trying to understand the difference in the computation of the two probabilities, which I think should be the same, before proceeding.)
...ANSWER
Answered 2021-Nov-27 at 08:21In your policy, you have Softmax over dim 0. This normalizes the probability of each action across your batch. You want to do it across actions by dim=1.
QUESTION
I successfully followed this official tensorflow tutorial for training an agent to solve the 'CartPole-v0' gym environment. I only diverged from the tutorial in that I did not use reverb, because it's not supported on Windows. I tried to modify the example to train the agent to solve my own (extremely simple) environment, but it fails to converge on a solution after 10,000 iterations, which I feel should be more than plenty.
I tried adjusting training iterations, learning rates, batch sizes, discounts, and everything else I could think of. Nothing had an effect on the result.
I would like the agent to converge on a policy that always gets +1 reward (ideally in only a few hundred iterations, since this environment is so extremely simple), instead of one that occasionally dips to -1. Instead, here's a graph of the actual outcome:
(The text is small so I will say that orange is episode length in steps, and blue is the average reward. The X axis is the number of training iterations, from 0 to 10,000.)
CODE Everything here is run top to bottom, but I put it in separate code blocks to make it easier to read/debug.Imports
...ANSWER
Answered 2021-Oct-16 at 23:56The cause of the issue was that the agent had no incentive to quickly solve the problem, because going to the right after 10 steps and after 3 steps both result in equal reward. Because the step counter was not observed, the agent could not possibly correlate taking too long with losing; so it would occasionally take more then 10 steps, lose, and be unable to learn from the experience.
I solved this by giving a -0.1 reward on every step, which incentivized the agent to solve the environment in as few steps as possible (causing it to never break the 10 step loss rule).
I also sped up the learning process by increasing the epsilon_greedy parameter of the DqnAgent's constructor to 0.5 (from it's default of 0.1) to allow it to more quickly explore the entire environment.
QUESTION
I've been reading through the Drake docs and any tutorials I found, but I've yet to see any detailed information about this problem/goal, so I thought I'd try to ask here.
Goal:
My current project is that I have a real-world furuta pendulum with sensors to obtain the cartpole equivalent of state vector q_sim = [x, theta, xdot, thetadot]
.
I want to be able have a real-world state-vector, q_real
, that "overwrites" the simulated cartpole state-vector, q_sim
, (acquired from MBP derived from Parser(plant=cart_pole).AddModelFromFile(sdf_path)
). So that the resulting behavior would be something like this: moving the real-world pendulum upright would be reflected (in real time) in the simulation pendulum moving upright. Eventually my goal would be able to have the furuta pendulum do a swing-up and balance behavior with the cartpole simulation reflecting the state vector, q_real
, in real time.
What I've Thought About/Tried:
- Extending
leafsystem
to read the real-world sensors and return a state vector as if it was a modeled dynamical system, with its ownactuator_port
for the controller output. My problem with this approach was that I wasn't sure/know if it was possible to do this whilst still visualizing a cart_pole MBP with the same state vector. I came to this dead-end through my reading of the MultiBodyPlant doc [ https://drake.mit.edu/doxygen_cxx/classdrake_1_1multibody_1_1_multibody_plant.html#details ]. I did not see any function to be able to replace/overwrite the simulation MBP'sstate_output_vector
. - I also considered the possibility of having two separate MBPs, one being the cart_pole derived from an
.sdf
file and another being an extendedleafsystem
that takes an actuator input, read sensors, and return a state vector output. However, with this setup I don't believe I am able to achieve the desired behavior of "moving the real-world pendulum upright and have the simulation pendulum move upright in real time"
Extra:
I also noticed that there's an Issue open on the drake github https://github.com/RobotLocomotion/drake/issues/12912about an official tutorial on drake's MBP from 2020. Does anyone know if there's any update to that tutorial beyond the Doxygen?
ANSWER
Answered 2021-Sep-07 at 01:54I don't think you want to overwrite your MBP plant with the data from simulation. The normal workflow would be to offer a different system that reads from your sensors and offers the same ports that your MBP would have offered.
In the extreme, you can even wrap all of your sensor and actuator drivers up into a system of their own that acts like a "mock" of the MBP. That's the pattern I offered in the ManipulationStation
and ManipulationStationHardwareInterface
. In addition to the doxygen, you could look here: https://manipulation.csail.mit.edu/robot.html#section4
QUESTION
I am making reinforcement learning for CartPole and i meet this problem
...ANSWER
Answered 2021-Aug-30 at 06:02You have given 4
inputs and for these 4
inputs the model is predicting 4
outputs. As your output layer has 2
neurons, hence, each of the 4
outputs has 2
values. It seems to be everything fine. And the output shape is (4, 2)
(not (2, 4)
).
If you are thinking how it is counted as (4, 2)
, therefore: To find the shape of a tensor manually start from left side and now if you enter inside a single [
, you will find 4
1 dimensional tensor
s, therefore similarly accessing inside any of these tensors you will find again 2
0 dimensional tensor
s (i.e, scalar
s). As you have reached up to 0 dimensional tensor
, now stop this process. This is how it is (4, 2)
.
QUESTION
I'm a student teaching myself Drake, specifically pydrake with Dr. Russ Tedrake's excellent Underactuated Robotics course. I am trying to write a combined energy shaping and lqr controller for keeping a cartpole system balanced upright. I based the diagram on the cartpole example found in Chapter 3 of Underactuated Robotics [http://underactuated.mit.edu/acrobot.html], and the SwingUpAndBalanceController on Chapter 2: [http://underactuated.mit.edu/pend.html].
I have found that due to my use of the cart_pole.sdf model
I have to create an abstract input port due receive FramePoseVector
from the cart_pole.get_output_port(0)
. From there I know that I have to create a control signal output of type BasicVector to feed into a Saturation block before feeding into the cartpole's actuation port.
The problem I'm encountering right now is that I'm not sure how to get the system's current state data in the DeclareVectorOutputPort
's callback function. I was under the assumption I would use the LeafContext
parameter in the callback function, OutputControlSignal
, obtaining the BasicVector
continuous state vector. However, this resulting vector, x_bar
is always NaN
. Out of desperation (and testing to make sure the rest of my program worked) I set x_bar
to the controller's initialization cart_pole_context
and have found that the simulation runs with a control signal of 0.0 (as expected). I can also set output
to 100 and the cartpole simulation just flies off into endless space (as expected).
TL;DR: What is the proper way to obtain the continuous state vector in a custom controller extending LeafSystem
with a DeclareVectorOutputPort
?
Thank you for any help! I really appreciate it :) I've been teaching myself so it's been a little arduous haha.
...ANSWER
Answered 2021-Aug-29 at 09:02Here are two things that might help:
- If you want to get the state of the cart-pole from
MultibodyPlant
, you probably want to be connecting to thecontinuous_state
output port, which gives you a normal vector instead of the abstract-typeFramePoseVector
. In that case, your call toget_input_port().Eval(context)
should work just fine. - If you do really want to read the
FramePoseVector
, then you have to evaluate the input port slightly differently. You can find an example of that here.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install CartPole
You can use CartPole like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page