trpo | trust region policy optimization base on gym and tensorflow | Reinforcement Learning library
kandi X-RAY | trpo Summary
kandi X-RAY | trpo Summary
trust region policy optimization base on gym and tensorflow, can run in distribution mode
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Get a single path
- Replaces the observation from the history
- Perform a step
- Process paths
- Discretize a signal
- Compute the likelihood ratio of a given dataset
- Slice a 2d array
- Computes the likelihood ratio of the input data
- Compute the logarithm of the log - likelihood
- Fit the model
- Create the network
- Compute the KL divergence of the first fixed distribution
- Computes the KL divergence between two distributions
- Predict for a given path
- Predict given path
- The observation space
- Convert a gym space
- Calculate the gradient of the loss function
- Return the shape of x
trpo Key Features
trpo Examples and Code Snippets
Community Discussions
Trending Discussions on trpo
QUESTION
I am currently trying to reproduce some results on my installation of flow from your previous papers. I ran over the following questions, where I am not clear about the exact parameters used in the experiments, and the results given in the papers.
For [1], I expected to be able to reproduce the results by running stabilizing_highway.py from your repo. (with commit "bc44b21", although I tried to run the current version, but could not find differences related to my questions). I expected the merge scenario used being the same in [2].
Where I already found differences in the papers/code was:
1) the reward function in [2] (2) is different than in [1] (6): the first uses a max and normalizing in the first part of the sum. Why this difference? Looking at the code, I interpret it the following: Depending on the evaluate flag, you either compute (a) the reward as average speed over all vehicles in the simulation or (b) as the function given in [2] (without the normalizing term on the speed), but with a value of alpha (eta2 in the code) = 0.1 (see merge.py, line 167, compute_reward). I could not find the alpha parameter given in the papers, so I assume the code version was used?
2) I further read the code as if you were calculating it by iterating over ALL vehicles in the simulation, not just the observed ones? This seems counterintuitive to me, using a reward function in a partially observed environment to train the agent by using information from the fully observed state information...!?
3) This leads to the next question: you eventually want to evaluate the reward as given when the evaluate flag is set, namely the average speed of all vehicles in the simulation, as given in Table 1 of [1]. Are these values calculated by averaging over the "speed" column in the emissions.csv file you can produce running the visualizer tool?
4) The next question is regarding the cumulative return in the Figures of [1] and [2]. In [1], FIgure 3, in the merge scenarios, the cum. returns are max of around 500, while the max. values of [2], Figure 5 are around 200000. Why this difference? The different reward functions used? Please, could you provide the alpha values for both and verify which version is correct (paper or code)?
5) What I also observe looking at [1] Table 1, Merge1&2: ES has clearly the highest values of average speed, but TRPO and PPO have a better cumulative return. Does this suggest that the 40 rollouts for evaluation where not enough to get a representative mean value? Or that maximizing the training reward function does not necessarily give good evaluation results?
6) Some other parameters are unclear to me: In [1] Fig3, 50 rollouts are mentioned, while N_ROLLOUTS=20. What do you recommend using? In [1] A.2 Merge, T=400, while HORIZON=600, and [2] C. Simulations talks about 3600s. Looking at a replay in Sumo produced when running visualizer_rllib.py, Simulation terminates at time 120.40, which would match the HORIZON of 600 with time steps of 0.2s (this information is given in [2].) So I assume, that for this scenario, the horizon should be set much higher than both in 1 and the code, and rather set to 18.000?
Thanks for any hints! KR M
[1] Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, F., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)
[2] Kreidieh, Abdul Rahman, Cathy Wu, and Alexandre M. Bayen. "Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning." In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 1475-1480. IEEE, 2018.
...ANSWER
Answered 2019-Jul-02 at 20:26Apologies for the delay in the answer.
The version described in the code was the one that is used. Paper [1] was written after paper [2] (despite one being published earlier) and we added a normalizing term to help standardize the learning rate across problems. The reward function is the one used in the codebase; the evaluate flag being true corresponds to actually computing the traffic statistic (i.e. speed) whereas it being false corresponds to the reward function we use at train time.
As you point out, using all of the vehicles in the reward function is a bad assumption, we obviously do not have access to all of that data (though you could imagine we are able to read it out through an induction loop). Future work will focus on removing this assumption.
You can do it this way. However, we just calculate it by running the experiment with the trained policy, storing all the vehicle speeds at each step, and then computing the result at the end of the experiment.
Unfortunately, both versions are "correct", as you point out, the difference has to do with the addition of the "eta" term in [2] and the normalization in [1].
It's as you say, the training reward function is not the same as the test reward function, so an algorithm that does well with the evaluate flag off may not do as well with the evaluate flag on. This is a choice we made, to have separate training and testing functions. You're welcome to use the testing function at train time!
Both should work; I suspect the N=20 in the codebase is something that crept in over time as people found that 50 was not necessary for that scenario. However, every RL algorithm does better with more rollouts so setting N=50 won't hurt. As for the horizon, as far as I can tell in the codebase the answer is that the sim_step is 0.5, the horizon is 750, so the experiment should run for 375 seconds.
If you have more questions, please email the corresponding author (me)! I'd love to help you work through this in more detail.
QUESTION
I'm wondering why is the Trust Region Policy Optimization a On-policy algorithm?
In my opinion, in TRPO, we samples by the old policy and update the new policy and apply the importance sampling to correct the bias. Thus, it is more like a off-policy algorithm. But recently, I read a paper which said:
In contrast to off-policy algorithms, on-policy methods require updating function approximatorsaccording to the currently followed policy. In particular, we will consider Trust Region PolicyOptimization, an extension of traditional policy gradient methods using the natural gradient direction.
Does any point I misunderstand?
...ANSWER
Answered 2019-Mar-27 at 13:47The key feature of on-policy methods is that they must use the estimated policy in order to interact with the environment. In the case of Trust Region Policy Optimization, effectively it adquires samples (i.e., interact with the environment) using the current policy, then updates the policy and uses the new policy estimation in the next iteration.
So, the algorithm is using the estimated policy during the learning process, which is the definition of on-policy methods.
QUESTION
I installed rllab successfully:
...ANSWER
Answered 2018-Apr-14 at 04:58I know this thread is quite old, but I started working on rllab lately and this is my understanding. rllab3 is a conda envrionment similar to virtual environment, as mentioned in the rllab documentation. It doesn't have the actual modules installed within it, you'd need to install it seperately.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install trpo
You can use trpo like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page