trpo | trust region policy optimization base on gym and tensorflow | Reinforcement Learning library

by jjkke88 Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | trpo Summary

trpo is a Python library typically used in Artificial Intelligence, Reinforcement Learning, Tensorflow applications. trpo has no bugs, it has no vulnerabilities and it has low support. However trpo build file is not available. You can download it from GitHub.

trust region policy optimization base on gym and tensorflow, can run in distribution mode

Support

Quality

Security

License

Reuse

Support

trpo has a low active ecosystem.

It has 17 star(s) with 13 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 1 have been closed. On average issues are closed in 4 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of trpo is current.

Quality

trpo has 0 bugs and 0 code smells.

Security

trpo has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

trpo code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

trpo does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

trpo releases are not available. You will need to build from source code and install.

trpo has no build file. You will be need to create the build yourself to build the component from source.

trpo saves you 1272 person hours of effort in developing the same functionality from scratch.

It has 2859 lines of code, 208 functions and 50 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed trpo and discovered the below as its top functions. This is intended to give you an instant insight into trpo implemented functionality, and help decide if they suit your requirements.

Get a single path
Replaces the observation from the history
Perform a step
Process paths
Discretize a signal
Compute the likelihood ratio of a given dataset
Slice a 2d array
Computes the likelihood ratio of the input data
Compute the logarithm of the log - likelihood
Fit the model
Create the network
Compute the KL divergence of the first fixed distribution
Computes the KL divergence between two distributions
Predict for a given path
Predict given path
The observation space
Convert a gym space
Calculate the gradient of the loss function
Return the shape of x

Get all kandi verified functions for this library.

trpo Key Features

No Key Features are available at this moment for trpo.

trpo Examples and Code Snippets

No Code Snippets are available at this moment for trpo.

Community Discussions

Trending Discussions on trpo

Simulation parameters and reward calculation in benchmark scenario "Merge"

Why is the Trust Region Policy Optimization a On-policy algorithm?

Python- error in importing rllab

QUESTION

Simulation parameters and reward calculation in benchmark scenario "Merge"

Asked 2019-Jul-02 at 20:26

I am currently trying to reproduce some results on my installation of flow from your previous papers. I ran over the following questions, where I am not clear about the exact parameters used in the experiments, and the results given in the papers.

For [1], I expected to be able to reproduce the results by running stabilizing_highway.py from your repo. (with commit "bc44b21", although I tried to run the current version, but could not find differences related to my questions). I expected the merge scenario used being the same in [2].

Where I already found differences in the papers/code was:

1) the reward function in [2] (2) is different than in [1] (6): the first uses a max and normalizing in the first part of the sum. Why this difference? Looking at the code, I interpret it the following: Depending on the evaluate flag, you either compute (a) the reward as average speed over all vehicles in the simulation or (b) as the function given in [2] (without the normalizing term on the speed), but with a value of alpha (eta2 in the code) = 0.1 (see merge.py, line 167, compute_reward). I could not find the alpha parameter given in the papers, so I assume the code version was used?

2) I further read the code as if you were calculating it by iterating over ALL vehicles in the simulation, not just the observed ones? This seems counterintuitive to me, using a reward function in a partially observed environment to train the agent by using information from the fully observed state information...!?

3) This leads to the next question: you eventually want to evaluate the reward as given when the evaluate flag is set, namely the average speed of all vehicles in the simulation, as given in Table 1 of [1]. Are these values calculated by averaging over the "speed" column in the emissions.csv file you can produce running the visualizer tool?

4) The next question is regarding the cumulative return in the Figures of [1] and [2]. In [1], FIgure 3, in the merge scenarios, the cum. returns are max of around 500, while the max. values of [2], Figure 5 are around 200000. Why this difference? The different reward functions used? Please, could you provide the alpha values for both and verify which version is correct (paper or code)?

5) What I also observe looking at [1] Table 1, Merge1&2: ES has clearly the highest values of average speed, but TRPO and PPO have a better cumulative return. Does this suggest that the 40 rollouts for evaluation where not enough to get a representative mean value? Or that maximizing the training reward function does not necessarily give good evaluation results?

6) Some other parameters are unclear to me: In [1] Fig3, 50 rollouts are mentioned, while N_ROLLOUTS=20. What do you recommend using? In [1] A.2 Merge, T=400, while HORIZON=600, and [2] C. Simulations talks about 3600s. Looking at a replay in Sumo produced when running visualizer_rllib.py, Simulation terminates at time 120.40, which would match the HORIZON of 600 with time steps of 0.2s (this information is given in [2].) So I assume, that for this scenario, the horizon should be set much higher than both in 1 and the code, and rather set to 18.000?

Thanks for any hints! KR M

[1] Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, F., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)

[2] Kreidieh, Abdul Rahman, Cathy Wu, and Alexandre M. Bayen. "Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning." In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 1475-1480. IEEE, 2018.

...

ANSWER

Answered 2019-Jul-02 at 20:26

Apologies for the delay in the answer.

The version described in the code was the one that is used. Paper [1] was written after paper [2] (despite one being published earlier) and we added a normalizing term to help standardize the learning rate across problems. The reward function is the one used in the codebase; the evaluate flag being true corresponds to actually computing the traffic statistic (i.e. speed) whereas it being false corresponds to the reward function we use at train time.
As you point out, using all of the vehicles in the reward function is a bad assumption, we obviously do not have access to all of that data (though you could imagine we are able to read it out through an induction loop). Future work will focus on removing this assumption.
You can do it this way. However, we just calculate it by running the experiment with the trained policy, storing all the vehicle speeds at each step, and then computing the result at the end of the experiment.
Unfortunately, both versions are "correct", as you point out, the difference has to do with the addition of the "eta" term in [2] and the normalization in [1].
It's as you say, the training reward function is not the same as the test reward function, so an algorithm that does well with the evaluate flag off may not do as well with the evaluate flag on. This is a choice we made, to have separate training and testing functions. You're welcome to use the testing function at train time!
Both should work; I suspect the N=20 in the codebase is something that crept in over time as people found that 50 was not necessary for that scenario. However, every RL algorithm does better with more rollouts so setting N=50 won't hurt. As for the horizon, as far as I can tell in the codebase the answer is that the sim_step is 0.5, the horizon is 750, so the experiment should run for 375 seconds.

If you have more questions, please email the corresponding author (me)! I'd love to help you work through this in more detail.

Source https://stackoverflow.com/questions/56650573

QUESTION

Why is the Trust Region Policy Optimization a On-policy algorithm?

Asked 2019-Mar-27 at 13:47

I'm wondering why is the Trust Region Policy Optimization a On-policy algorithm?

In my opinion, in TRPO, we samples by the old policy and update the new policy and apply the importance sampling to correct the bias. Thus, it is more like a off-policy algorithm. But recently, I read a paper which said:

In contrast to off-policy algorithms, on-policy methods require updating function approximatorsaccording to the currently followed policy. In particular, we will consider Trust Region PolicyOptimization, an extension of traditional policy gradient methods using the natural gradient direction.

Does any point I misunderstand?

...

ANSWER

Answered 2019-Mar-27 at 13:47

The key feature of on-policy methods is that they must use the estimated policy in order to interact with the environment. In the case of Trust Region Policy Optimization, effectively it adquires samples (i.e., interact with the environment) using the current policy, then updates the policy and uses the new policy estimation in the next iteration.

So, the algorithm is using the estimated policy during the learning process, which is the definition of on-policy methods.

Source https://stackoverflow.com/questions/55371106

QUESTION

Python- error in importing rllab

Asked 2018-Apr-14 at 04:58

I installed rllab successfully:

...

ANSWER

Answered 2018-Apr-14 at 04:58

I know this thread is quite old, but I started working on rllab lately and this is my understanding. rllab3 is a conda envrionment similar to virtual environment, as mentioned in the rllab documentation. It doesn't have the actual modules installed within it, you'd need to install it seperately.

Source https://stackoverflow.com/questions/47123353

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install trpo

You can download it from GitHub.
You can use trpo like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: