Reinforcement Learning

Explore all libraries in Reinforcement Learning

Explore all Reinforcement Learning open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in Reinforcement Learning

gym

0.23.1

AirSim

v1.7.0 - Linux

ml-agents

ML-Agents Release 19

pwnagotchi

Practical_RL

Spring 2020

Popular Libraries in Reinforcement Learning

gym

by openai python

26869 NOASSERTION

A toolkit for developing and comparing reinforcement learning algorithms.

reinforcement-learning

by dennybritz jupyter notebook

17019 MIT

Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.

AirSim

by microsoft c++

12921 NOASSERTION

Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

ml-agents

by Unity-Technologies csharp

12332 NOASSERTION

Unity Machine Learning Agents Toolkit

Reinforcement-learning-with-tensorflow

by MorvanZhou python

6578 MIT

Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学

spinningup

by openai python

6054 MIT

An educational resource to help anyone learn deep reinforcement learning.

pwnagotchi

by evilsocket javascript

4717 NOASSERTION

(⌐■_■) - Deep Reinforcement Learning instrumenting bettercap for WiFi pwning.

Practical_RL

by yandexdataschool jupyter notebook

4686 Unlicense

A course in reinforcement learning in the wild

tianshou

by thu-ml python

4494 MIT

An elegant PyTorch deep reinforcement learning library.

Explore all libraries in Reinforcement Learning

Trending New libraries in Reinforcement Learning

stable-baselines3

by DLR-RM python

3249 MIT

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

acme

by deepmind python

2517 Apache-2.0

A library of reinforcement learning components and agents

DouZero

by kwai python

2041 Apache-2.0

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

leedeeprl-notes

by datawhalechina python

1323 NOASSERTION

李宏毅《深度强化学习》笔记，在线阅读地址：https://datawhalechina.github.io/leedeeprl-notes/

rlax

by deepmind python

774 Apache-2.0

ai-economist

by salesforce python

769 BSD-3-Clause

Foundation is a flexible, modular, and composable framework to model socio-economic behaviors and dynamics with both agents and governments. This framework can be used in conjunction with reinforcement learning to learn optimal economic policies, as done by the AI Economist (https://www.einstein.ai/the-ai-economist).

PettingZoo

by PettingZoo-Team python

757 NOASSERTION

Gym for multi-agent reinforcement learning

pfrl

by pfnet python

690 MIT

PFRL: a PyTorch-based deep reinforcement learning library

rl-baselines3-zoo

by DLR-RM python

639 MIT

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.

Top Authors in Reinforcement Learning

facebookresearch

27 Libraries

10258

openai

21 Libraries

43203

microsoft

13 Libraries

14902

deepmind

12 Libraries

14357

PacktPublishing

12 Libraries

1103

reinforcement-learning-kr

9 Libraries

752

Kaixhin

9 Libraries

1942

google-research

8 Libraries

1445

uvipen

7 Libraries

2478

flowersteam

7 Libraries

135

facebookresearch

27 Libraries

10258

openai

21 Libraries

43203

microsoft

13 Libraries

14902

deepmind

12 Libraries

14357

PacktPublishing

12 Libraries

1103

reinforcement-learning-kr

9 Libraries

752

Kaixhin

9 Libraries

1942

google-research

8 Libraries

1445

uvipen

7 Libraries

2478

flowersteam

7 Libraries

135

Trending Kits in Reinforcement Learning

No Trending Kits are available at this moment for Reinforcement Learning

Trending Discussions on Reinforcement Learning

tensorboard not showing results using ray rllib

Why does my model not learn? Very high loss

Action masking for continuous action space in reinforcement learning

Using BatchedPyEnvironment in tf_agents

Keras GradientType: Calculating gradients with respect to the output node

RuntimeError: Found dtype Double but expected Float - PyTorch

What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?

Weird-looking curve in DRL

keras-rl model with multiple outputs

no method matching logpdf when sampling from uniform distribution

tensorboard not showing results using ray rllib

Why does my model not learn? Very high loss

Action masking for continuous action space in reinforcement learning

Using BatchedPyEnvironment in tf_agents

Keras GradientType: Calculating gradients with respect to the output node

RuntimeError: Found dtype Double but expected Float - PyTorch

What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?

Weird-looking curve in DRL

keras-rl model with multiple outputs

no method matching logpdf when sampling from uniform distribution

QUESTION

tensorboard not showing results using ray rllib

Asked 2022-Mar-28 at 09:14

I am trainig a reinforcement learning model on google colab using tune and rllib. At first I was able to show the training results useing tensorboard but it is no longer working and I can't seem to find where it comes from, I didn't change anything so I feel a bit lost here.

What it shows (the directory is the right one) :

My current directory :

The training phase:

1ray.init(ignore_reinit_error=True)
2
3tune.run(&quot;PPO&quot;,
4         config = {&quot;env&quot; : CustomEnv2,
5                  #  &quot;evaluation_interval&quot; : 2,
6                  #  &quot;evaluation_num_episodes&quot; : 2,
7                   &quot;num_workers&quot; :1},
8         num_samples=1,
9        #  checkpoint_at_end=True,
10         stop={&quot;training_iteration&quot;: 10},
11         local_dir = './test1')
12

Plotting results:

1ray.init(ignore_reinit_error=True)
2
3tune.run(&quot;PPO&quot;,
4         config = {&quot;env&quot; : CustomEnv2,
5                  #  &quot;evaluation_interval&quot; : 2,
6                  #  &quot;evaluation_num_episodes&quot; : 2,
7                   &quot;num_workers&quot; :1},
8         num_samples=1,
9        #  checkpoint_at_end=True,
10         stop={&quot;training_iteration&quot;: 10},
11         local_dir = './test1')
12%load_ext tensorboard 
13
14%tensorboard --logdir='/content/test1/PPO/PPO_CustomEnv2_024da_00000_0_2022-03-23_09-02-47'
15

ANSWER

Answered 2022-Mar-25 at 02:06

You are using Rllib, right? I actually don't see the tensorboard file (i.e. events.out.tfevents.xxx.xxx) in your path. Maybe you should check if you have this file first.

Source https://stackoverflow.com/questions/71584763

QUESTION

Why does my model not learn? Very high loss

Asked 2022-Mar-25 at 10:49

I built a simulation model where trucks collect garbage containers based on their fill level. I used OpenAi Gym and Tensorflow/keras to create my Deep Reinforcement Learning model... But my training has a very high loss... Where did I go wrong? Thanks in advance

this is the Env

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335

My model

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336  model = tf.keras.Sequential([
337      keras.layers.Dense(64, input_shape=states),
338      keras.layers.LeakyReLU(0.24,),
339      keras.layers.Dense(64),
340      keras.layers.LeakyReLU(0.24,),
341      keras.layers.Dense(32),
342      keras.layers.LeakyReLU(0.24,),
343  
344      keras.layers.Dense(16),
345      keras.layers.LeakyReLU(0.24,),
346      keras.layers.Dense(8),
347      keras.layers.LeakyReLU(0.24,),
348      
349      keras.layers.Dense(actions, activation='linear'),
350      
351])
352  return model
353

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336  model = tf.keras.Sequential([
337      keras.layers.Dense(64, input_shape=states),
338      keras.layers.LeakyReLU(0.24,),
339      keras.layers.Dense(64),
340      keras.layers.LeakyReLU(0.24,),
341      keras.layers.Dense(32),
342      keras.layers.LeakyReLU(0.24,),
343  
344      keras.layers.Dense(16),
345      keras.layers.LeakyReLU(0.24,),
346      keras.layers.Dense(8),
347      keras.layers.LeakyReLU(0.24,),
348      
349      keras.layers.Dense(actions, activation='linear'),
350      
351])
352  return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse',  metrics=['accuracy'])
357
358

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336  model = tf.keras.Sequential([
337      keras.layers.Dense(64, input_shape=states),
338      keras.layers.LeakyReLU(0.24,),
339      keras.layers.Dense(64),
340      keras.layers.LeakyReLU(0.24,),
341      keras.layers.Dense(32),
342      keras.layers.LeakyReLU(0.24,),
343  
344      keras.layers.Dense(16),
345      keras.layers.LeakyReLU(0.24,),
346      keras.layers.Dense(8),
347      keras.layers.LeakyReLU(0.24,),
348      
349      keras.layers.Dense(actions, activation='linear'),
350      
351])
352  return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse',  metrics=['accuracy'])
357
358def build_agent(model, actions):
359      policy = GreedyQPolicy()
360      memory = SequentialMemory(limit=10000, window_length=1)
361      dqn = DQNAgent(model=model, memory=memory, policy=policy,nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
362      
363      return dqn
364

1class Marltf(Env):
2    def __init__(self):
3       
4        self.i= 0
5        self.containers1 = Container(3,3)
6        self.containers2 = Container(1,3)
7        self.containers3 = Container(3,1)
8        self.containers4 = Container(5,6)
9        self.containers5 = Container(8,6)
10        self.containers6 = Container(10,10)
11        self.containers7 = Container(11,11)
12        self.containers8 = Container(7,12) 
13        self.passo = 0
14        self.containers2.lv = 2
15        self.containers3.lv = 4
16        self.containers5.lv = 4
17        self.containers6.lv = 1
18        self.containers8.lv = 2
19        self.shower_length= 300
20        
21        self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22        self.positions ={}
23        self.capacities ={}
24        self.camions= []
25        b = 0
26        for cont in self.containers:
27            b += cont.lv
28        reward = 0
29        nCamionFloat = 0
30        while b &gt; 6:
31          b +=-10
32          nCamionFloat +=1
33        nCamionInt = int(nCamionFloat)
34       
35        for ic in range(nCamionInt):
36          self.camions.append(Camion(1,1,None,ic))
37
38
39        for cam in self.camions:
40          
41          self.positions[cam.name] = cam.position  
42          self.capacities[cam.name] = 10
43        
44        
45        self.frames = []
46        self.cnt=0  
47
48
49        self.mapp = Map(15,15,self.camions,self.containers)
50
51        self.state = (15*15)/5
52        self.action_space = gym.spaces.Discrete(4)
53        self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55    def step(self, action):
56      
57        moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58        
59        done = False
60       
61        ic = 0   
62        for cam in self.camions: 
63            cam.position = (self.positions[ic][0],self.positions[ic][1])            
64            cam.capacity = self.capacities[ic] 
65            
66            self.state += -5
67            
68
69        mossa = moves[action]
70        x=self.camions[self.i].position
71        reward = 0
72        nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73        self.shower_length -= 1 
74        if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75          reward += -5
76          self.state += -5
77        
78        else:
79            self.mapp.mapp[x[0],x[1]] = 0
80            self.camions[self.i].position=nuovaposizione
81            self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82            self.positions.update({self.camions[self.i].name : nuovaposizione})
83           
84            
85            
86            
87            reward += -1
88            self.state = -2
89            
90
91        for contain in self.containers:
92                  if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93                        
94                        if contain.lv ==3 and self.camions[self.i].capacity &gt;=3:
95                            self.camions[self.i].reward += 100
96                            self.camions[self.i].capacity += -3
97                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98                            reward +=20
99                            
100                            self.state +=20
101                         
102                            contain.lv=0
103
104                        elif contain.lv == 2 and self.camions[self.i].capacity &gt;=2:
105                            self.camions[self.i].reward += 50
106                            self.camions[self.i].capacity += -2
107                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108                            self.state +=10
109                          
110                            reward += 50
111                            
112                            contain.lv=0
113
114                        elif contain.lv == 1 and self.camions[self.i].capacity &gt;=1:
115                            
116                            reward += 10
117                            self.camions[self.i].reward +=5
118                            self.camions[self.i].capacity += -1
119                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120                            contain.lv=0
121                            self.state+=1
122                           
123                        elif contain.lv==4 and self.camions[self.i].capacity &gt;=4:
124                            reward +=50
125                            self.camions[self.i].reward +=50
126                            self.camions[self.i].capacity += -4
127                            self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128                            self.state +=50
129                            contain.lv=0
130                           
131                          
132                        elif contain.lv==0 and self.camions[self.i].capacity &gt;=4:
133                            reward += -20
134                            self.camions[self.i].reward +=-20
135                            self.camions[self.i].capacity += 0
136                            self.state += -20
137                            contain.lv=0
138                         
139                        
140                  if self.camions[self.i].capacity &lt;=2:
141                              self.camions[self.i].positions=(1,1)
142                              self.positions.update({self.camions[self.i].name : (1,1)})
143
144                              self.camions[self.i].capacity = 10
145                              self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147                  
148                  
149
150                          
151
152        if self.i ==1:
153                      self.i= 0              
154                      self.i = 0
155                      self.i = 0
156        elif self.i ==0:
157                      self.i= 1
158
159
160        if self.shower_length &lt;= 0: 
161            done = True
162        else:
163            done = False
164        
165
166        self.passo +=1
167       
168        
169        
170        
171        
172
173        info = {}
174        
175        return self.state,reward,done,info
176
177
178
179    def render(self, mode=&quot;human&quot;):
180           
181            BLACK = (0, 0, 0)
182            WHITE = (200, 200, 200)
183            
184            WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185            WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186           
187            whiteC=pygame.image.load('white.jpg')
188            whiteC=pygame.transform.scale(whiteC,(50, 50))
189           
190            greenC=pygame.image.load('green.jpg')
191            greenC=pygame.transform.scale(greenC,(50, 50))
192            
193            yellowC=pygame.image.load('yellow.jpg')
194            yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196            orangeC=pygame.image.load('orange.jpg')
197            orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199            redC=pygame.image.load('red.jpg')
200            redC=pygame.transform.scale(redC,(50, 50))
201
202            
203            gT=pygame.image.load('greenCamion.jpg')
204            gT=pygame.transform.scale(gT,(50, 50))
205
206            yT=pygame.image.load('yellowCamion.jpg')
207            yT=pygame.transform.scale(yT,(50, 50))
208
209            rT=pygame.image.load('redCamion.jpg')
210            rT=pygame.transform.scale(rT,(50, 50))
211            
212           
213            
214            
215            global SCREEN, CLOCK
216            pygame.init()
217            SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218            CLOCK = pygame.time.Clock()
219            SCREEN.fill(BLACK)
220            
221            pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222            blockSize = 50 #Set the size of the grid block
223            
224            for i in range(0,len(self.mapp.mapp[0])):
225              for j in range(0,len(self.mapp.mapp[0])):
226                      a=i*50
227                      b=j*50
228                    
229                      if self.mapp.mapp[i][j] == -1:
230                        pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232            for c in self.camions :
233              if c.capacity &gt; 6:
234                SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235              
236              if c.capacity &gt; 3 and c.capacity &lt;= 6:
237                SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))     
238              
239              if c.capacity &lt;= 3:
240                SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241            
242            
243            for contain in self.containers :
244              if contain.lv == 0:
245                 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246              
247              elif contain.lv == 1:
248                  SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249              
250              elif contain.lv == 2:
251                  SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252              
253              elif contain.lv == 3:
254                 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255              
256              if contain.lv == 4:
257                 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258              
259                
260            
261            for x in range(0, WINDOW_WIDTH, blockSize):
262                for y in range(0, WINDOW_HEIGHT, blockSize):
263                    rect = pygame.Rect(x, y, blockSize, blockSize)
264                    pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266            pygame.display.flip()
267
268            view = pygame.surfarray.array3d(SCREEN)
269            view = view.transpose([1, 0, 2])
270
271            img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272           
273            
274            
275            
276            
277            pygame.image.save(SCREEN, f&quot;screenshot{self.cnt}.png&quot;)
278            self.cnt +=1
279            pygame.event.get()
280                    
281
282
283       
284    def reset(self):
285        self.state = (15*15)/4
286        self.shower_length = 300
287        
288        self.containers1.lv=3
289        self.containers2.lv=1
290        self.containers7.lv = 2 
291        self.containers3.lv = 4
292        self.containers5.lv = 4
293        self.containers6.lv = 1
294        self.containers8.lv = 2
295        self.passo = 0
296        self.positions ={}
297        self.capacities ={}
298        self.camions= []
299        b = 0
300        for cont in self.containers:
301            b += cont.lv
302        reward = 0
303        nCamionFloat = 0
304        while b &gt; 6:
305          b +=-10
306          nCamionFloat +=1
307        nCamionInt = int(nCamionFloat)
308      
309        for ic in range(nCamionInt):
310          self.camions.append(Camion(1,1,None,ic))
311
312
313        for cam in self.camions:
314          
315          self.positions[cam.name] = cam.position  
316          self.capacities[cam.name] = 10
317        
318        self.shower_length =60
319        self.cnt=0  
320        self.i = 0
321            
322            
323        
324
325
326
327
328        containers = [    containers1,    containers2,    containers3,    containers4]
329        containers.append(    containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336  model = tf.keras.Sequential([
337      keras.layers.Dense(64, input_shape=states),
338      keras.layers.LeakyReLU(0.24,),
339      keras.layers.Dense(64),
340      keras.layers.LeakyReLU(0.24,),
341      keras.layers.Dense(32),
342      keras.layers.LeakyReLU(0.24,),
343  
344      keras.layers.Dense(16),
345      keras.layers.LeakyReLU(0.24,),
346      keras.layers.Dense(8),
347      keras.layers.LeakyReLU(0.24,),
348      
349      keras.layers.Dense(actions, activation='linear'),
350      
351])
352  return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse',  metrics=['accuracy'])
357
358def build_agent(model, actions):
359      policy = GreedyQPolicy()
360      memory = SequentialMemory(limit=10000, window_length=1)
361      dqn = DQNAgent(model=model, memory=memory, policy=policy,nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
362      
363      return dqn
364
365
366dqn = build_agent(model, actions)
367dqn.compile(tf.keras.optimizers.Adadelta(
368    learning_rate=0.1, rho=0.95, epsilon=1e-07, name='Adadelta'), metrics= [&quot;accuracy&quot;] 
369)
370
371a =dqn.fit(env, nb_steps=5000, visualize=True, verbose=2,)
372
373

the loss starts from 50 and reaches 200

ANSWER

Answered 2022-Mar-25 at 02:47

loss does not really matter in RL. Very high loss is actually normal. In RL we care the reward most.

Source https://stackoverflow.com/questions/71575887

QUESTION

Action masking for continuous action space in reinforcement learning

Asked 2022-Mar-17 at 08:28

Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.

Simple example:

The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?

What I tried: For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).

I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?

ANSWER

Answered 2022-Mar-17 at 08:28

I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions. I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.

Source https://stackoverflow.com/questions/71437246

QUESTION

Using BatchedPyEnvironment in tf_agents

Asked 2022-Feb-19 at 18:11

I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.

I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).

My custom environment is called SacEnv, and I attempt to create a batched environment like so:

1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4

My hope is that this will create a batched environment consisting of a 'batch' of non-batched environments. However I am receiving the following error when running the code:

1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5

with the stack trace:

1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5Traceback (most recent call last):
6  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 370, in &lt;module&gt;
7    app.run(main)
8  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py&quot;, line 312, in run
9    _run_main(main, args)
10  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py&quot;, line 258, in _run_main
11    sys.exit(main(argv))
12  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 366, in main
13    train_eval(FLAGS.root_dir)
14  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1605, in gin_wrapper
15    utils.augment_exception_message_and_reraise(e, err_str)
16  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py&quot;, line 41, in augment_exception_message_and_reraise
17    raise proxy.with_traceback(exception.__traceback__) from None
18  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1582, in gin_wrapper
19    return fn(*new_args, **new_kwargs)
20  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 274, in train_eval
21    results = metric_utils.eager_compute(
22  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1605, in gin_wrapper
23    utils.augment_exception_message_and_reraise(e, err_str)
24  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py&quot;, line 41, in augment_exception_message_and_reraise
25    raise proxy.with_traceback(exception.__traceback__) from None
26  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1582, in gin_wrapper
27    return fn(*new_args, **new_kwargs)
28  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py&quot;, line 163, in eager_compute
29    common.function(driver.run)(time_step, policy_state)
30  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py&quot;, line 153, in error_handler
31    raise e.with_traceback(filtered_tb) from None
32  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 211, in run
33    return self._run_fn(
34  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py&quot;, line 188, in with_check_resource_vars
35    return fn(*fn_args, **fn_kwargs)
36  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 238, in _run
37    tf.while_loop(
38  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 154, in loop_body
39    observer_ops = [observer(traj) for observer in self._observers]
40  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 154, in &lt;listcomp&gt;
41    observer_ops = [observer(traj) for observer in self._observers]
42  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py&quot;, line 93, in __call__
43    return self._update_state(*args, **kwargs)
44  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py&quot;, line 81, in _update_state
45    return self.call(*arg, **kwargs)
46ValueError: in user code:
47
48    File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py&quot;, line 176, in call  *
49        self._return_accumulator.assign(
50
51    ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
52
53  In call to configurable 'eager_compute' (&lt;function eager_compute at 0x7fa4d6e5e040&gt;)
54  In call to configurable 'train_eval' (&lt;function train_eval at 0x7fa4c8622dc0&gt;)
55

I have dug through the tf_metric.py code to try and understand the error, however I have been unsuccessful. A related issue was solved when I added the batch size (32) to the initializer for the AverageReturnMetric instance, and this issue seems related.

The full code is:

1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5Traceback (most recent call last):
6  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 370, in &lt;module&gt;
7    app.run(main)
8  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py&quot;, line 312, in run
9    _run_main(main, args)
10  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py&quot;, line 258, in _run_main
11    sys.exit(main(argv))
12  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 366, in main
13    train_eval(FLAGS.root_dir)
14  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1605, in gin_wrapper
15    utils.augment_exception_message_and_reraise(e, err_str)
16  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py&quot;, line 41, in augment_exception_message_and_reraise
17    raise proxy.with_traceback(exception.__traceback__) from None
18  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1582, in gin_wrapper
19    return fn(*new_args, **new_kwargs)
20  File &quot;/home/gary/Desktop/code/sac_test/sac_main2.py&quot;, line 274, in train_eval
21    results = metric_utils.eager_compute(
22  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1605, in gin_wrapper
23    utils.augment_exception_message_and_reraise(e, err_str)
24  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py&quot;, line 41, in augment_exception_message_and_reraise
25    raise proxy.with_traceback(exception.__traceback__) from None
26  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py&quot;, line 1582, in gin_wrapper
27    return fn(*new_args, **new_kwargs)
28  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py&quot;, line 163, in eager_compute
29    common.function(driver.run)(time_step, policy_state)
30  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py&quot;, line 153, in error_handler
31    raise e.with_traceback(filtered_tb) from None
32  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 211, in run
33    return self._run_fn(
34  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py&quot;, line 188, in with_check_resource_vars
35    return fn(*fn_args, **fn_kwargs)
36  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 238, in _run
37    tf.while_loop(
38  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 154, in loop_body
39    observer_ops = [observer(traj) for observer in self._observers]
40  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py&quot;, line 154, in &lt;listcomp&gt;
41    observer_ops = [observer(traj) for observer in self._observers]
42  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py&quot;, line 93, in __call__
43    return self._update_state(*args, **kwargs)
44  File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py&quot;, line 81, in _update_state
45    return self.call(*arg, **kwargs)
46ValueError: in user code:
47
48    File &quot;/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py&quot;, line 176, in call  *
49        self._return_accumulator.assign(
50
51    ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
52
53  In call to configurable 'eager_compute' (&lt;function eager_compute at 0x7fa4d6e5e040&gt;)
54  In call to configurable 'train_eval' (&lt;function train_eval at 0x7fa4c8622dc0&gt;)
55# coding=utf-8
56# Copyright 2020 The TF-Agents Authors.
57#
58# Licensed under the Apache License, Version 2.0 (the &quot;License&quot;);
59# you may not use this file except in compliance with the License.
60# You may obtain a copy of the License at
61#
62#     https://www.apache.org/licenses/LICENSE-2.0
63#
64# Unless required by applicable law or agreed to in writing, software
65# distributed under the License is distributed on an &quot;AS IS&quot; BASIS,
66# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
67# See the License for the specific language governing permissions and
68# limitations under the License.
69
70# Lint as: python2, python3
71
72r&quot;&quot;&quot;Train and Eval SAC.
73
74All hyperparameters come from the SAC paper
75https://arxiv.org/pdf/1812.05905.pdf
76
77To run:
78
79```bash
80tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &amp;
81
82python tf_agents/agents/sac/examples/v2/train_eval.py \
83  --root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
84  --alsologtostderr
85\```
86&quot;&quot;&quot;
87
88from __future__ import absolute_import
89from __future__ import division
90from __future__ import print_function
91
92from sac_env import SacEnv
93
94import os
95import time
96
97from absl import app
98from absl import flags
99from absl import logging
100
101import gin
102from six.moves import range
103import tensorflow as tf  # pylint: disable=g-explicit-tensorflow-version-import
104
105from tf_agents.agents.ddpg import critic_network
106from tf_agents.agents.sac import sac_agent
107from tf_agents.agents.sac import tanh_normal_projection_network
108from tf_agents.drivers import dynamic_step_driver
109#from tf_agents.environments import suite_mujoco
110from tf_agents.environments import tf_py_environment
111from tf_agents.environments import batched_py_environment
112from tf_agents.eval import metric_utils
113from tf_agents.metrics import tf_metrics
114from tf_agents.networks import actor_distribution_network
115from tf_agents.policies import greedy_policy
116from tf_agents.policies import random_tf_policy
117from tf_agents.replay_buffers import tf_uniform_replay_buffer
118from tf_agents.utils import common
119from tf_agents.train.utils import strategy_utils
120
121
122flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
123                    'Root directory for writing logs/summaries/checkpoints.')
124flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
125flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')
126
127FLAGS = flags.FLAGS
128
129gpus = tf.config.list_physical_devices('GPU')
130if gpus:
131    try:
132        for gpu in gpus:
133            tf.config.experimental.set_memory_growth(gpu, True)
134        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
135        print(len(gpus), &quot;Physical GPUs,&quot;, len(logical_gpus), &quot;Logical GPUs&quot;)
136    except RuntimeError as e:
137        print(e)
138
139@gin.configurable
140def train_eval(
141    root_dir,
142    env_name='SacEnv',
143    # The SAC paper reported:
144    # Hopper and Cartpole results up to 1000000 iters,
145    # Humanoid results up to 10000000 iters,
146    # Other mujoco tasks up to 3000000 iters.
147    num_iterations=3000000,
148    actor_fc_layers=(256, 256),
149    critic_obs_fc_layers=None,
150    critic_action_fc_layers=None,
151    critic_joint_fc_layers=(256, 256),
152    # Params for collect
153    # Follow https://github.com/haarnoja/sac/blob/master/examples/variants.py
154    # HalfCheetah and Ant take 10000 initial collection steps.
155    # Other mujoco tasks take 1000.
156    # Different choices roughly keep the initial episodes about the same.
157    #initial_collect_steps=10000,
158    initial_collect_steps=2000,
159    collect_steps_per_iteration=1,
160    replay_buffer_capacity=31250, # 1000000 / 32
161    # Params for target update
162    target_update_tau=0.005,
163    target_update_period=1,
164    # Params for train
165    train_steps_per_iteration=1,
166    #batch_size=256,
167    batch_size=32,
168    actor_learning_rate=3e-4,
169    critic_learning_rate=3e-4,
170    alpha_learning_rate=3e-4,
171    td_errors_loss_fn=tf.math.squared_difference,
172    gamma=0.99,
173    reward_scale_factor=0.1,
174    gradient_clipping=None,
175    use_tf_functions=True,
176    # Params for eval
177    num_eval_episodes=30,
178    eval_interval=10000,
179    # Params for summaries and logging
180    train_checkpoint_interval=50000,
181    policy_checkpoint_interval=50000,
182    rb_checkpoint_interval=50000,
183    log_interval=1000,
184    summary_interval=1000,
185    summaries_flush_secs=10,
186    debug_summaries=False,
187    summarize_grads_and_vars=False,
188    eval_metrics_callback=None):
189  &quot;&quot;&quot;A simple train and eval for SAC.&quot;&quot;&quot;
190  root_dir = os.path.expanduser(root_dir)
191  train_dir = os.path.join(root_dir, 'train')
192  eval_dir = os.path.join(root_dir, 'eval')
193
194  train_summary_writer = tf.compat.v2.summary.create_file_writer(
195      train_dir, flush_millis=summaries_flush_secs * 1000)
196  train_summary_writer.set_as_default()
197
198  eval_summary_writer = tf.compat.v2.summary.create_file_writer(
199      eval_dir, flush_millis=summaries_flush_secs * 1000)
200  eval_metrics = [
201      tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
202      tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
203  ]
204
205  global_step = tf.compat.v1.train.get_or_create_global_step()
206  with tf.compat.v2.summary.record_if(
207      lambda: tf.math.equal(global_step % summary_interval, 0)):
208
209
210    py_envs = [SacEnv() for _ in range(0, batch_size)]
211    batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
212    tf_env = tf_py_environment.TFPyEnvironment(batched_env)
213    
214    eval_py_envs = [SacEnv() for _ in range(0, batch_size)]
215    eval_batched_env = batched_py_environment.BatchedPyEnvironment(envs=eval_py_envs)
216    eval_tf_env = tf_py_environment.TFPyEnvironment(eval_batched_env)
217
218    time_step_spec = tf_env.time_step_spec()
219    observation_spec = time_step_spec.observation
220    action_spec = tf_env.action_spec()
221
222    strategy = strategy_utils.get_strategy(tpu=False, use_gpu=True)
223
224    with strategy.scope():
225        actor_net = actor_distribution_network.ActorDistributionNetwork(
226            observation_spec,
227            action_spec,
228            fc_layer_params=actor_fc_layers,
229            continuous_projection_net=tanh_normal_projection_network
230            .TanhNormalProjectionNetwork)
231        critic_net = critic_network.CriticNetwork(
232            (observation_spec, action_spec),
233            observation_fc_layer_params=critic_obs_fc_layers,
234            action_fc_layer_params=critic_action_fc_layers,
235            joint_fc_layer_params=critic_joint_fc_layers,
236            kernel_initializer='glorot_uniform',
237            last_kernel_initializer='glorot_uniform')
238
239        tf_agent = sac_agent.SacAgent(
240            time_step_spec,
241            action_spec,
242            actor_network=actor_net,
243            critic_network=critic_net,
244            actor_optimizer=tf.compat.v1.train.AdamOptimizer(
245                learning_rate=actor_learning_rate),
246            critic_optimizer=tf.compat.v1.train.AdamOptimizer(
247                learning_rate=critic_learning_rate),
248            alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
249                learning_rate=alpha_learning_rate),
250            target_update_tau=target_update_tau,
251            target_update_period=target_update_period,
252            td_errors_loss_fn=td_errors_loss_fn,
253            gamma=gamma,
254            reward_scale_factor=reward_scale_factor,
255            gradient_clipping=gradient_clipping,
256            debug_summaries=debug_summaries,
257            summarize_grads_and_vars=summarize_grads_and_vars,
258            train_step_counter=global_step)
259    tf_agent.initialize()
260
261    # Make the replay buffer.
262    replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
263        data_spec=tf_agent.collect_data_spec,
264        batch_size=batch_size,
265        max_length=replay_buffer_capacity,
266        device=&quot;/device:GPU:0&quot;)
267    replay_observer = [replay_buffer.add_batch]
268
269    train_metrics = [
270        tf_metrics.NumberOfEpisodes(),
271        tf_metrics.EnvironmentSteps(),
272        tf_metrics.AverageReturnMetric(
273            buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
274        tf_metrics.AverageEpisodeLengthMetric(
275            buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
276    ]
277
278    eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
279    initial_collect_policy = random_tf_policy.RandomTFPolicy(
280        tf_env.time_step_spec(), tf_env.action_spec())
281    collect_policy = tf_agent.collect_policy
282
283    train_checkpointer = common.Checkpointer(
284        ckpt_dir=train_dir,
285        agent=tf_agent,
286        global_step=global_step,
287        metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
288    policy_checkpointer = common.Checkpointer(
289        ckpt_dir=os.path.join(train_dir, 'policy'),
290        policy=eval_policy,
291        global_step=global_step)
292    rb_checkpointer = common.Checkpointer(
293        ckpt_dir=os.path.join(train_dir, 'replay_buffer'),
294        max_to_keep=1,
295        replay_buffer=replay_buffer)
296
297    train_checkpointer.initialize_or_restore()
298    rb_checkpointer.initialize_or_restore()
299
300    initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
301        tf_env,
302        initial_collect_policy,
303        observers=replay_observer + train_metrics,
304        num_steps=initial_collect_steps)
305
306    collect_driver = dynamic_step_driver.DynamicStepDriver(
307        tf_env,
308        collect_policy,
309        observers=replay_observer + train_metrics,
310        num_steps=collect_steps_per_iteration)
311
312    if use_tf_functions:
313      initial_collect_driver.run = common.function(initial_collect_driver.run)
314      collect_driver.run = common.function(collect_driver.run)
315      tf_agent.train = common.function(tf_agent.train)
316
317    if replay_buffer.num_frames() == 0:
318      # Collect initial replay data.
319      logging.info(
320          'Initializing replay buffer by collecting experience for %d steps '
321          'with a random policy.', initial_collect_steps)
322      initial_collect_driver.run()
323
324    results = metric_utils.eager_compute(
325        eval_metrics,
326        eval_tf_env,
327        eval_policy,
328        num_episodes=num_eval_episodes,
329        train_step=global_step,
330        summary_writer=eval_summary_writer,
331        summary_prefix='Metrics',
332    )
333    if eval_metrics_callback is not None:
334      eval_metrics_callback(results, global_step.numpy())
335    metric_utils.log_metrics(eval_metrics)
336
337    time_step = None
338    policy_state = collect_policy.get_initial_state(tf_env.batch_size)
339
340    timed_at_step = global_step.numpy()
341    time_acc = 0
342
343    # Prepare replay buffer as dataset with invalid transitions filtered.
344    def _filter_invalid_transition(trajectories, unused_arg1):
345      return ~trajectories.is_boundary()[0]
346    dataset = replay_buffer.as_dataset(
347        sample_batch_size=batch_size,
348        num_steps=2).unbatch().filter(
349            _filter_invalid_transition).batch(batch_size).prefetch(5)
350    # Dataset generates trajectories with shape [Bx2x...]
351    iterator = iter(dataset)
352
353    def train_step():
354      experience, _ = next(iterator)
355      return tf_agent.train(experience)
356
357    if use_tf_functions:
358      train_step = common.function(train_step)
359
360    global_step_val = global_step.numpy()
361    while global_step_val &lt; num_iterations:
362      start_time = time.time()
363      time_step, policy_state = collect_driver.run(
364          time_step=time_step,
365          policy_state=policy_state,
366      )
367      for _ in range(train_steps_per_iteration):
368        train_loss = train_step()
369      time_acc += time.time() - start_time
370
371      global_step_val = global_step.numpy()
372
373      if global_step_val % log_interval == 0:
374        logging.info('step = %d, loss = %f', global_step_val,
375                     train_loss.loss)
376        steps_per_sec = (global_step_val - timed_at_step) / time_acc
377        logging.info('%.3f steps/sec', steps_per_sec)
378        tf.compat.v2.summary.scalar(
379            name='global_steps_per_sec', data=steps_per_sec, step=global_step)
380        timed_at_step = global_step_val
381        time_acc = 0
382
383      for train_metric in train_metrics:
384        train_metric.tf_summaries(
385            train_step=global_step, step_metrics=train_metrics[:2])
386
387      if global_step_val % eval_interval == 0:
388        results = metric_utils.eager_compute(
389            eval_metrics,
390            eval_tf_env,
391            eval_policy,
392            num_episodes=num_eval_episodes,
393            train_step=global_step,
394            summary_writer=eval_summary_writer,
395            summary_prefix='Metrics',
396        )
397        if eval_metrics_callback is not None:
398          eval_metrics_callback(results, global_step_val)
399        metric_utils.log_metrics(eval_metrics)
400
401      if global_step_val % train_checkpoint_interval == 0:
402        train_checkpointer.save(global_step=global_step_val)
403
404      if global_step_val % policy_checkpoint_interval == 0:
405        policy_checkpointer.save(global_step=global_step_val)
406
407      if global_step_val % rb_checkpoint_interval == 0:
408        rb_checkpointer.save(global_step=global_step_val)
409    return train_loss
410
411
412def main(_):
413  tf.compat.v1.enable_v2_behavior()
414  logging.set_verbosity(logging.INFO)
415  gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
416  train_eval(FLAGS.root_dir)
417
418if __name__ == '__main__':
419  flags.mark_flag_as_required('root_dir')
420  app.run(main)
421

What is the appropriate way to create a batched environment for a custom, non-batched environment? I can share my custom environment, but I don't believe the issue lies there as the code works fine when using batch sizes of 1.

Also, any tips on increasing GPU utilization in reinforcement learning scenarios would be greatly appreciated. I have examined examples of using tensorboard-profiler to profile GPU utilization, but it seems these require callbacks and a fit function, which doesn't seem to be applicable in RL use-cases.

ANSWER

Answered 2022-Feb-19 at 18:11

It turns out I neglected to pass batch_size when initializing the AverageReturnMetric and AverageEpisodeLengthMetric instances.

Source https://stackoverflow.com/questions/71168412

QUESTION

Keras GradientType: Calculating gradients with respect to the output node

Asked 2022-Jan-19 at 10:54

For startes: this question does not ask for help regarding reinforcement learning (RL), RL is only used as an example.

The Keras documentation contains an example actor-critic reinforcement learning implementation using Gradient Tape. Basically, they've created a model with two separate outputs: one for the actor (n actions) and one for the critic (1 reward). The following lines describe the backpropagation process (found somewhere in the code example):

1# Backpropagation
2loss_value = sum(actor_losses) + sum(critic_losses)
3grads = tape.gradient(loss_value, model.trainable_variables)
4optimizer.apply_gradients(zip(grads, model.trainable_variables))
5

Despite the fact that the actor and critic losses are calculated differently, they sum up those two losses to obtain the final loss value used for calculating the gradients.

When looking at this code example, one question came to my mind: Is there a way to calculate the gradients of the output layer with respect to the corresponding losses, i.e. calculate the gradients of the first n output nodes based on the actor loss and the gradient of the last output node using the critic loss? For my understanding, this would be much more convenient than adding both losses (different!) and updating the gradients based on this cumulative approach. Do you agree?

ANSWER

Answered 2022-Jan-19 at 10:54

Well, after some research I found the answer myself: It is possible to extract the trainable variables of a given layer based on the layer name. Then we can apply tape.gradient and optimizer.apply_gradients to the extracted set of trainable variables. My current solution is pretty slow, but it works. I just need to figure out how to improve its runtime.

Source https://stackoverflow.com/questions/70756617

QUESTION

RuntimeError: Found dtype Double but expected Float - PyTorch

Asked 2022-Jan-08 at 23:25

I am new to pytorch and I am working on DQN for a timeseries using Reinforcement Learning and I needed to have a complex observation of timeseries and some sensor readings, so I merged two neural networks and I am not sure if that's what is ruining my loss.backward or something else. I know there is multiple questions with the same title but none worked for me, maybe I am missing something.
First of all, this is my network:

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38

I have a function called calc_loss, and at its end it return the MSE loss as below

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38  print(state_action_values.dtype)
39  print(expected_state_action_values.dtype) 
40  return nn.MSELoss()(state_action_values, expected_state_action_values)
41

and the print shows float32 and float64 respectively.
I get the error when I run the loss.backward() as below

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38  print(state_action_values.dtype)
39  print(expected_state_action_values.dtype) 
40  return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45  optimizer.zero_grad()
46  loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47  print(loss_v.dtype)
48  print(loss_v)
49  loss_v.backward()
50  optimizer.step()
51

and the print output is as below:
torch.float64
tensor(1887.4831, dtype=torch.float64, grad_fn=)

Update 1:
I tried using a simpler model, yet the same issue, when I tried to cast the inputs to Float, I got an error:

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38  print(state_action_values.dtype)
39  print(expected_state_action_values.dtype) 
40  return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45  optimizer.zero_grad()
46  loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47  print(loss_v.dtype)
48  print(loss_v)
49  loss_v.backward()
50  optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52

What makes the model expects double ?

Update 2:
I tried to add the below line on top after the torch import but same issue of RuntimeError: Found dtype Double but expected Float

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38  print(state_action_values.dtype)
39  print(expected_state_action_values.dtype) 
40  return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45  optimizer.zero_grad()
46  loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47  print(loss_v.dtype)
48  print(loss_v)
49  loss_v.backward()
50  optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52&gt;&gt;&gt; torch.set_default_tensor_type(torch.FloatTensor)
53

But when I used the DoubleTensor I got: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.DoubleTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

ANSWER

Answered 2022-Jan-08 at 23:25

The issue wasn't in the input to the network but the criterion of the MSELoss, so it worked fine after casting the criterion to float as below

1class DQN(nn.Module):
2  def __init__(self, list_shape, score_shape, n_actions):
3    super(DQN, self).__init__()
4
5    self.FeatureList =  nn.Sequential(
6            nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7            nn.ReLU(),
8            nn.Conv1d(32, 64, kernel_size=4, stride=2),
9            nn.ReLU(),
10            nn.Conv1d(64, 64, kernel_size=3, stride=1),
11            nn.ReLU(),
12            nn.Flatten()
13        )
14    
15    self.FeatureScore = nn.Sequential(
16            nn.Linear(score_shape[1], 512),
17            nn.ReLU(),
18            nn.Linear(512, 128)
19        )
20    
21    t_list_test = torch.zeros(list_shape)
22    t_score_test = torch.zeros(score_shape)
23    merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24    
25    self.FinalNN =  nn.Sequential(
26            nn.Linear(merge_shape, 512),
27            nn.ReLU(),
28            nn.Linear(512, 128),
29            nn.ReLU(),
30            nn.Linear(128, n_actions),
31    )
32    
33  def forward(self, list, score):
34    listOut = self.FeatureList(list)
35    scoreOut = self.FeatureScore(score)
36    MergedTensor = torch.cat((listOut,scoreOut),1)
37    return self.FinalNN(MergedTensor)
38  print(state_action_values.dtype)
39  print(expected_state_action_values.dtype) 
40  return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45  optimizer.zero_grad()
46  loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47  print(loss_v.dtype)
48  print(loss_v)
49  loss_v.backward()
50  optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52&gt;&gt;&gt; torch.set_default_tensor_type(torch.FloatTensor)
53return nn.MSELoss()(state_action_values.float(), expected_state_action_values.float())
54

I decided to leave the answer for beginners like me who might be stuck and didn't expect to check the datatype of the loss criterion

Source https://stackoverflow.com/questions/70615514

QUESTION

What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?

Asked 2021-Dec-23 at 11:07

I followed a PyTorch tutorial to learn reinforcement learning(TRAIN A MARIO-PLAYING RL AGENT) but I am confused about the following code:

1current_Q = self.net(state, model=&quot;online&quot;)[np.arange(0, self.batch_size), action] # Q_online(s,a)
2

What's the purpose of [np.arange(0, self.batch_size), action] after the neural network?(I know that TD_estimate takes in state and action, just confused about this on the programming side) What is this usage(put a list after self.net)?

More related code referenced from the tutorial:

1current_Q = self.net(state, model=&quot;online&quot;)[np.arange(0, self.batch_size), action] # Q_online(s,a)
2class MarioNet(nn.Module):
3
4def __init__(self, input_dim, output_dim):
5    super().__init__()
6    c, h, w = input_dim
7
8    if h != 84:
9        raise ValueError(f&quot;Expecting input height: 84, got: {h}&quot;)
10    if w != 84:
11        raise ValueError(f&quot;Expecting input width: 84, got: {w}&quot;)
12
13    self.online = nn.Sequential(
14        nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
15        nn.ReLU(),
16        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
17        nn.ReLU(),
18        nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
19        nn.ReLU(),
20        nn.Flatten(),
21        nn.Linear(3136, 512),
22        nn.ReLU(),
23        nn.Linear(512, output_dim),
24    )
25
26    self.target = copy.deepcopy(self.online)
27
28    # Q_target parameters are frozen.
29    for p in self.target.parameters():
30        p.requires_grad = False
31
32def forward(self, input, model):
33    if model == &quot;online&quot;:
34        return self.online(input)
35    elif model == &quot;target&quot;:
36        return self.target(input)
37

self.net:

1current_Q = self.net(state, model=&quot;online&quot;)[np.arange(0, self.batch_size), action] # Q_online(s,a)
2class MarioNet(nn.Module):
3
4def __init__(self, input_dim, output_dim):
5    super().__init__()
6    c, h, w = input_dim
7
8    if h != 84:
9        raise ValueError(f&quot;Expecting input height: 84, got: {h}&quot;)
10    if w != 84:
11        raise ValueError(f&quot;Expecting input width: 84, got: {w}&quot;)
12
13    self.online = nn.Sequential(
14        nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
15        nn.ReLU(),
16        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
17        nn.ReLU(),
18        nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
19        nn.ReLU(),
20        nn.Flatten(),
21        nn.Linear(3136, 512),
22        nn.ReLU(),
23        nn.Linear(512, output_dim),
24    )
25
26    self.target = copy.deepcopy(self.online)
27
28    # Q_target parameters are frozen.
29    for p in self.target.parameters():
30        p.requires_grad = False
31
32def forward(self, input, model):
33    if model == &quot;online&quot;:
34        return self.online(input)
35    elif model == &quot;target&quot;:
36        return self.target(input)
37self.net = MarioNet(self.state_dim, self.action_dim).float()
38

Thanks for any help!

ANSWER

Answered 2021-Dec-23 at 11:07

Essentially, what happens here is that the output of the net is being sliced to get the desired part of the Q table.

The (somewhat confusing) index of [np.arange(0, self.batch_size), action] indexes each axis. So, for axis with index 1, we pick the item indicated by action. For index 0, we pick all items between 0 and self.batch_size.

If self.batch_size is the same as the length of dimension 0 of this array, then this slice can be simplified to [:, action] which is probably more familiar to most users.

Source https://stackoverflow.com/questions/70458347

QUESTION

Weird-looking curve in DRL

Asked 2021-Dec-12 at 14:38

I have a deep reinforcement learning agent that interacts with a customized environment and I am displaying the reward value every episode using tensorboard. The curve looks like this

For some reason it jumps to step 80 after step 17 every time and I cannot understand why, I don't even know what part of the code I should copy paste here.

Anyone has any idea why it does that ?

ANSWER

Answered 2021-Dec-12 at 14:38

Turns out the step number is getting incremented elsewhere, commented that line and it works fine now.

Source https://stackoverflow.com/questions/70309923

QUESTION

keras-rl model with multiple outputs

Asked 2021-Dec-02 at 12:27

I want to build a reinforcement learning model with keras which needs to have two outputs. can it be done the same way that the Keras library does or is it even doable?

this is what I want to do

1inp = Input(shape=(input_layer_size, ))
2x = Dense(hidden_layer_size, activation=&quot;relu&quot;)(inp)
3for i in range(nb_hidden_layer):
4    x = Dense(hidden_layer_size, activation=&quot;relu&quot;)(x)
5a1 = Dense(1, activation='sigmoid')(x)
6a2 = Dense(1, activation='sigmoid')(x)
7
8

ANSWER

Answered 2021-Dec-02 at 12:27

yes, it is possible, just use:

1inp = Input(shape=(input_layer_size, ))
2x = Dense(hidden_layer_size, activation=&quot;relu&quot;)(inp)
3for i in range(nb_hidden_layer):
4    x = Dense(hidden_layer_size, activation=&quot;relu&quot;)(x)
5a1 = Dense(1, activation='sigmoid')(x)
6a2 = Dense(1, activation='sigmoid')(x)
7
8model = Model(inp, [a1,a2])
9

and pay attention to the order of your outputs to don't mistaken them.

Source https://stackoverflow.com/questions/70199324

QUESTION

no method matching logpdf when sampling from uniform distribution

Asked 2021-Nov-18 at 23:01

I am trying to use reinforcement learning in julia to teach a car that is constantly being accelerated backwards (but with a positive initial velocity) to apply brakes so that it gets as close to a target distance as possible before moving backwards.

To do this, I am making use of POMDPs.jl and crux.jl which has many solvers (I'm using DQN). I will list what I believe to be the relevant parts of the script first, and then more of it towards the end.

To define the MDP, I set the initial position, velocity, and force from the brakes as a uniform distribution over some values.

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7

My state holds the values of (position, velocity, brake force), and the initial state is given as:

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10

Then, I set up my DQN solver using crux.jl and called a function to solve for the policy

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12

calling solve() gives me the error MethodError: no method matching logpdf(::Distributions.Categorical{Float64, Vector{Float64}}, ::Nothing). I am quite sure that this comes from the initial state sampling, but I am not sure why or how to fix it. I have only been learning RL from various books and online lectures for a very short time, so any help regarding the error or my the model I set up (or anything else I'm oblivious to) would be appreciated.

More comprehensive code:

Packages:

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24

Rest of it:

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
25    x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
26    v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
27    d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
28    
29    m::Float64 = 1.
30    tension::Float64 = 3.
31    dmax::Float64 = 2.
32    target::Float64 = 80.
33    dt::Float64 = .05
34    
35    γ::Float32 = 1.
36    actions::Vector{Float64} = [-.1, 0., .1]
37end
38    
39function POMDPs.gen(env::SliderMDP, s, a, rng::AbstractRNG = Random.GLOBAL_RNG)
40    x, ẋ, d = s
41
42    if x &gt;= env.target
43        a = .1
44    end
45    if d+a &gt;= env.dmax || d+a &lt;= 0
46        a = 0.
47    end
48    
49    force = (d + env.tension) * -1
50    ẍ = force/env.m
51    
52    # Simulation
53    x_ = x + env.dt * ẋ
54    ẋ_ = ẋ + env.dt * ẍ
55    d_ = d + a
56
57    sp = vcat(x_, ẋ_, d_)
58    reward = abs(env.target - x) * -1
59        
60    return (sp=sp, r=reward)
61end
62
63    
64
65function POMDPs.initialstate(mdp::SliderMDP)
66    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
67end
68    
69POMDPs.isterminal(mdp::SliderMDP, s) = s[2] &lt;= 0
70POMDPs.discount(mdp::SliderMDP) = mdp.γ
71
72mdp = SliderMDP();
73s = state_space(mdp); # Using Crux.jl
74
75function Q_network()
76    layer1 = Dense(3, 64, relu)
77    layer2 = Dense(64, 64, relu)
78    layer3 = Dense(64, length(3))
79    return DiscreteNetwork(Chain(layer1, layer2, layer3), [-.1, 0, .1])
80end
81
82solver_dqn = DQN(π=Q_network(), S=s, N=30000) # Using Crux.jl
83policy_dqn = solve(solver_dqn, mdp) # Error comes here
84
85

Stacktrace:

1@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
2        x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3        v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4        d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5        ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24@with_kw struct SliderMDP &lt;: MDP{Array{Float32}, Array{Float32}}
25    x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
26    v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
27    d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
28    
29    m::Float64 = 1.
30    tension::Float64 = 3.
31    dmax::Float64 = 2.
32    target::Float64 = 80.
33    dt::Float64 = .05
34    
35    γ::Float32 = 1.
36    actions::Vector{Float64} = [-.1, 0., .1]
37end
38    
39function POMDPs.gen(env::SliderMDP, s, a, rng::AbstractRNG = Random.GLOBAL_RNG)
40    x, ẋ, d = s
41
42    if x &gt;= env.target
43        a = .1
44    end
45    if d+a &gt;= env.dmax || d+a &lt;= 0
46        a = 0.
47    end
48    
49    force = (d + env.tension) * -1
50    ẍ = force/env.m
51    
52    # Simulation
53    x_ = x + env.dt * ẋ
54    ẋ_ = ẋ + env.dt * ẍ
55    d_ = d + a
56
57    sp = vcat(x_, ẋ_, d_)
58    reward = abs(env.target - x) * -1
59        
60    return (sp=sp, r=reward)
61end
62
63    
64
65function POMDPs.initialstate(mdp::SliderMDP)
66    ImplicitDistribution((rng) -&gt; Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
67end
68    
69POMDPs.isterminal(mdp::SliderMDP, s) = s[2] &lt;= 0
70POMDPs.discount(mdp::SliderMDP) = mdp.γ
71
72mdp = SliderMDP();
73s = state_space(mdp); # Using Crux.jl
74
75function Q_network()
76    layer1 = Dense(3, 64, relu)
77    layer2 = Dense(64, 64, relu)
78    layer3 = Dense(64, length(3))
79    return DiscreteNetwork(Chain(layer1, layer2, layer3), [-.1, 0, .1])
80end
81
82solver_dqn = DQN(π=Q_network(), S=s, N=30000) # Using Crux.jl
83policy_dqn = solve(solver_dqn, mdp) # Error comes here
84
85policy_dqn
86MethodError: no method matching logpdf(::Distributions.Categorical{Float64, Vector{Float64}}, ::Nothing)
87
88Closest candidates are:
89
90logpdf(::Distributions.DiscreteNonParametric, !Matched::Real) at C:\Users\name\.julia\packages\Distributions\Xrm9e\src\univariate\discrete\discretenonparametric.jl:106
91
92logpdf(::Distributions.UnivariateDistribution{S} where S&lt;:Distributions.ValueSupport, !Matched::AbstractArray) at deprecated.jl:70
93
94logpdf(!Matched::POMDPPolicies.PlaybackPolicy, ::Any) at C:\Users\name\.julia\packages\POMDPPolicies\wMOK3\src\playback.jl:34
95
96...
97
98logpdf(::Crux.ObjectCategorical, ::Float32)@utils.jl:16
99logpdf(::Crux.DistributionPolicy, ::Vector{Float64}, ::Float32)@policies.jl:305
100var&quot;#exploration#133&quot;(::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(Crux.exploration), ::Crux.DistributionPolicy, ::Vector{Float64})@policies.jl:302
101exploration@policies.jl:297[inlined]
102action(::Crux.DistributionPolicy, ::Vector{Float64})@policies.jl:294
103var&quot;#exploration#136&quot;(::Crux.DiscreteNetwork, ::Int64, ::typeof(Crux.exploration), ::Crux.MixedPolicy, ::Vector{Float64})@policies.jl:326
104var&quot;#step!#173&quot;(::Bool, ::Int64, ::typeof(Crux.step!), ::Dict{Symbol, Array}, ::Int64, ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace})@sampler.jl:55
105var&quot;#steps!#174&quot;(::Int64, ::Bool, ::Int64, ::Bool, ::Bool, ::Bool, ::typeof(Crux.steps!), ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace})@sampler.jl:108
106var&quot;#fillto!#177&quot;(::Int64, ::Bool, ::typeof(Crux.fillto!), ::Crux.ExperienceBuffer{Array}, ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace}, ::Int64)@sampler.jl:156
107solve(::Crux.OffPolicySolver, ::Main.workspace#2.SliderMDP)@off_policy.jl:86
108top-level scope@Local: 1[inlined]
109

ANSWER

Answered 2021-Nov-18 at 23:01

Short answer:

Change your output vector to Float32 i.e. Float32[-.1, 0, .1].

Long answer:

Crux creates a Distribution over your network's output values, and at some point (policies.jl:298) samples a random value from it. It then converts this value to a Float32. Later (utils.jl:15) it does a findfirst to find the index of this value in the original output array (stored as objs within the distribution), but because the original array is still Float64, this fails and returns a nothing. Hence the error.

I believe this (converting the sampled value but not the objs array and/or not using approximate equality check i.e. findfirst(isapprox(x), d.objs)) to be a bug in the package, and would encourage you to raise this as an issue on Github.

Source https://stackoverflow.com/questions/70015203

Community Discussions contain sources that include Stack Exchange Network