cartpole | OpenAI 's cartpole env solver | Machine Learning library

by gsurma Python Version: Current License: MIT

X-Ray Key Features Code Snippets(3)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | cartpole Summary

cartpole is a Python library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch applications. cartpole has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. source.

Support

Quality

Security

License

Reuse

Support

cartpole has a low active ecosystem.

It has 137 star(s) with 111 fork(s). There are 10 watchers for this library.

It had no major release in the last 6 months.

There are 2 open issues and 2 have been closed. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of cartpole is current.

Quality

cartpole has 0 bugs and 0 code smells.

Security

cartpole has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

cartpole code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

cartpole is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

cartpole releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

cartpole saves you 60 person hours of effort in developing the same functionality from scratch.

It has 156 lines of code, 9 functions and 3 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed cartpole and discovered the below as its top functions. This is intended to give you an instant insight into cartpole implemented functionality, and help decide if they suit your requirements.

Saves the results to a PNG file .
Do a cart pole .
Add a score .
Perform an experience replay .
Initialize the model .
Save the score to a csv file .
Return the action corresponding to the given state .
Store a context manager .

Get all kandi verified functions for this library.

cartpole Key Features

No Key Features are available at this moment for cartpole.

cartpole Examples and Code Snippets

Quick Start

pypi

Lines of Code : 42

License : No License

Copy

import gym, torch, numpy as np, torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts


task = 'CartPole-v0'
lr, epoch, batch_size = 1e-3, 10, 64
train_num, test_num = 10, 100
gamma, n_step, target_freq = 0.9, 3, 320

Runs a cartpole - v0 .

python

Lines of Code : 35

License : No License

Copy

def main():
  env = gym.make('CartPole-v0')
  D = env.observation_space.shape[0]
  K = env.action_space.n
  pmodel = PolicyModel(D, K, [])
  vmodel = ValueModel(D, [10])
  init = tf.global_variables_initializer()
  session = tf.InteractiveSession()

Play a CartPole - v0 .

python

Lines of Code : 30

License : No License

Copy

def main():
  env = gym.make('CartPole-v0')
  D = env.observation_space.shape[0]
  K = env.action_space.n
  pmodel = PolicyModel(D, K, [])
  vmodel = ValueModel(D, [10])
  gamma = 0.99

  if 'monitor' in sys.argv:
    filename = os.path.basename(__fi

Community Discussions

Trending Discussions on cartpole

DQN Pytorch Loss keeps increasing

Deep Q Learning - Cartpole Environment

RLLib tunes PPOTrainer but not A2CTrainer

Can a tf-agents environment be defined with an unobservable exogenous state?

Problem with implementing temporal difference based on actor-critic

How to extend an agent class in ChainerRL in Python

Understanding State Groups in Context in Drake

Recording multiple meshcat visualizers

Pytorch DQN, DDQN using .detach() caused very wield loss (increases exponentially) and do not learn at all

What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?

QUESTION

DQN Pytorch Loss keeps increasing

Asked 2021-Jun-02 at 17:39

I am implementing simple DQN algorithm using pytorch, to solve the CartPole environment from gym. I have been debugging for a while now, and I cant figure out why the model is not learning.

Observations:

using SmoothL1Loss performs worse than MSEloss, but loss increases for both
smaller LR in Adam does not work, I have tested using 0.0001, 0.00025, 0.0005 and default

Notes:

I have debugged various parts of the algorithm individually, and can say with good confidence that the issue is in the learn function. I am wondering if this bug is due to me misunderstanding detach in pytorch or some other framework mistake im making.
I am trying to stick as close to the original paper as possible (linked above)

References:

example: GitHub gist
example: pytroch official

...

ANSWER

Answered 2021-Jun-02 at 17:39

The main problem I think is the discount factor, gamma. You are setting it to 1.0, which mean that you are giving the same weight to the future rewards as the current one. Usually in reinforcement learning we care more about the immediate reward than the future, so gamma should always be less than 1.

Just to give it a try I set gamma = 0.99 and run your code:

Source https://stackoverflow.com/questions/67789148

QUESTION

Deep Q Learning - Cartpole Environment

Asked 2021-May-31 at 22:21

I have a concern in understanding the Cartpole code as an example for Deep Q Learning. The DQL Agent part of the code as follow:

...

ANSWER

Answered 2021-May-31 at 22:21

self.model.predict(state) will return a tensor of shape of (1, 2) containing the estimated Q values for each action (in cartpole the action space is {0,1}). As you know the Q value is a measure of the expected reward.

By setting self.model.predict(state)[0][action] = target (where target is the expected sum of rewards) it is creating a target Q value on which to train the model. By then calling model.fit(state, train_target) it is using the target Q value to train said model to approximate better Q values for each state.

I don't understand why you are saying that the loss becomes 0: the target is set to the discounted sum of rewards plus the current reward

Source https://stackoverflow.com/questions/67773479

QUESTION

RLLib tunes PPOTrainer but not A2CTrainer

Asked 2021-Feb-11 at 18:29

I am making a comparison between both kind of algorithms against the CartPole environment. Having the imports as:

...

ANSWER

Answered 2021-Feb-11 at 18:29

The A2C code fails due to the configuration you copied from the PPO trial: "sgd_minibatch_size", "kl_coeff" and many others are PPO-specific configs, which cause the problem when running using A2C.

The error is explained in the "error.txt" in the logdir.

Source https://stackoverflow.com/questions/65668160

QUESTION

Can a tf-agents environment be defined with an unobservable exogenous state?

Asked 2021-Jan-13 at 08:17

I apologize in advance for the question in the title not being very clear. I'm trying to train a reinforcement learning policy using tf-agents in which there exists some unobservable stochastic variable that affects the state.

For example, consider the standard CartPole problem, but we add wind where the velocity changes over time. I don't want to train an agent that relies on having observed the wind velocity at each step; I instead want the wind to affect the position and angular velocity of the pole, and the agent to learn to adapt just as it would in the wind-free environment. In this example however, we would need the wind velocity at the current time to be correlated with the wind velocity at the previous time e.g. we wouldn't want the wind velocity to change from 10m/s at time t to -10m/s at time t+1.

The problem I'm trying to solve is how to track the state of the exogenous variable without making it part of the observation spec that gets fed into the neural network when training the agent. Any guidance would be appreciated.

...

ANSWER

Answered 2021-Jan-13 at 08:17

Yes, that is no problem at all. Your environment object (a subclass of PyEnvironment or TFEnvironment) can do whatever you want within it. The observation_spec requirement is only related to the TimeStep that you output in the step and reset methods (more precisely in your implementation of the _step and _reset abstract methods).

Your environment however is completely free to have any additional attributes that you might want (like parameters to control wind generation) and any number of additional methods you like (like methods to generate the wind at this timestep according to self._wind_hyper_params). A quick schematic of your code would look like is below:

Source https://stackoverflow.com/questions/65694416

QUESTION

Problem with implementing temporal difference based on actor-critic

Asked 2020-Dec-02 at 13:11

I implemented a simple actor-critic model in Tensorflow==2.3.1 to learn Cartpole environment. But it is not learning at all. The average scores of every 50 episodes is below 20. Can someone please point out why the model isn't learning?

I based my algorithm on the following pseudocode:

...

ANSWER

Answered 2020-Dec-02 at 13:11

I was able to fix your code. Main changes:

replace math.log() with tfp.distributions.Categorical.log_prob()
change error calculation method

But I'm not entirely sure why it works this way, so further clarification is appreciated.

Source https://stackoverflow.com/questions/65029828

QUESTION

How to extend an agent class in ChainerRL in Python

Asked 2020-Nov-25 at 11:52

I want to extend the PPO agent class in ChainerRL. I did the following:

...

ANSWER

Answered 2020-Nov-25 at 11:50

action = super().act_and_train(obs, reward)

Source https://stackoverflow.com/questions/65003111

QUESTION

Understanding State Groups in Context in Drake

Asked 2020-Nov-24 at 17:57

I've created a diagram with an LQR Controller, a MultiBodyPlant, a scenegraph, and a PlanarSceneGraphVisualizer.

While trying to run this simulation, I set the random initial conditions using the function: context.SetDiscreteState(randInitState). However, with this, I get the following error:

RuntimeError: Context::SetDiscreteState(): expected exactly 1 discrete state group but there were 2 groups. Use the other signature if you have multiple groups.

And indeed when I check the number of groups using context.num_discrete_state_groups(), it returns 2. So, then I have to specify the group index while setting the state using the command context.SetDiscreteState(0, randInitState). This works but I don't exactly know why. I understand that I have to select a correct group to set the state for but what exactly is a group here? In the cartpole example given here, the context was set using context.SetContinuousState(UprightState() + 0.1 * np.random.randn(4,)) without specifying any group(s).

Are groups only valid for discrete systems? The context documentation talks about groups and but doesn't define them.

Is there a place to find the definition of what a group is while setting up a drake simulation with multiple systems inside a diagram and how to check the group index of a system?

...

ANSWER

Answered 2020-Nov-24 at 17:57

We would typically recommend that you use a workflow that sets the context using a subsystem interface. E.g.

Source https://stackoverflow.com/questions/64987916

QUESTION

Recording multiple meshcat visualizers

Asked 2020-Nov-11 at 02:41

I would like to record multiple meshcat visualizers with different prefix ids so I can eventually save them to an HTML file that I can interact with later. I would like the different visualizations to play on top of each other and to be able to select/unselect different visualizations.

Here is a minimal replication of the problem. I expect there to be two cartpole visualizations that are on top of each other. However, I only end up seeing one of them being recorded. The other cartpole seems stuck at the end position when I play back the recording

...

ANSWER

Answered 2020-Nov-11 at 02:41

Here is one solution. I don't love it, but it accomplishes the stated goal. Note the changes:

saving and publishing only one animation
naming the model in the MultibodyPlant with a distinct name for each sim
setting delete_prefix_on_load=False so it doesn't clear the old geometry

Source https://stackoverflow.com/questions/64757012

QUESTION

Pytorch DQN, DDQN using .detach() caused very wield loss (increases exponentially) and do not learn at all

Asked 2020-Nov-09 at 16:36

Here is my implementation of DQN and DDQN for CartPole-v0 which I think is correct.

...

ANSWER

Answered 2020-Nov-09 at 16:36

so one mistake in your implementation is that you never add the end of an episode to your replay buffer. In your train function you return if sign==1 (end of the episode). Remove that return and adjust the target calculation via (1-dones)*... in case you sample a transition of the end of an episode. The reason why the end of the episode is important is that it is the only experience is where the target is not approximated via bootstrapping. Then DQN trains. For reproducibility I used a discount rate of 0.99 and the seed 2020 (for torch, numpy and the gym environment). I achieved a reward of 199.100 after 241 episodes of training.

Hope that helps, code is very readable btw.

Source https://stackoverflow.com/questions/64690471

QUESTION

What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?

Asked 2020-Oct-24 at 15:29

I have made a small script in Python to solve various Gym environments with policy gradients.

...

ANSWER

Answered 2020-Sep-09 at 06:58

The loss here depends on what output on each problem. Generaly, loss for backpropagate should be a number that represents for everything you have processed. For policy gradient, it will be the reward that it think it will get compare with the original reward, the log is just a way to bring it back to a probabily random variable. Single dimension. If you want to inspect the behavior behind codes, you should always check the shape/dimension between each process to fully understand

Source https://stackoverflow.com/questions/63602222

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install cartpole

You can download it from GitHub.
You can use cartpole like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: