Popular New Releases in Reinforcement Learning
gym
0.23.1
AirSim
v1.7.0 - Linux
ml-agents
ML-Agents Release 19
pwnagotchi
Practical_RL
Spring 2020
Popular Libraries in Reinforcement Learning
by openai python
26869 NOASSERTION
A toolkit for developing and comparing reinforcement learning algorithms.
by dennybritz jupyter notebook
17019 MIT
Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course.
by microsoft c++
12921 NOASSERTION
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research
by Unity-Technologies csharp
12332 NOASSERTION
Unity Machine Learning Agents Toolkit
by MorvanZhou python
6578 MIT
Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学
by openai python
6054 MIT
An educational resource to help anyone learn deep reinforcement learning.
by evilsocket javascript
4717 NOASSERTION
(⌐■_■) - Deep Reinforcement Learning instrumenting bettercap for WiFi pwning.
by yandexdataschool jupyter notebook
4686 Unlicense
A course in reinforcement learning in the wild
by thu-ml python
4494 MIT
An elegant PyTorch deep reinforcement learning library.
Trending New libraries in Reinforcement Learning
by DLR-RM python
3249 MIT
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
by deepmind python
2517 Apache-2.0
A library of reinforcement learning components and agents
by kwai python
2041 Apache-2.0
[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI
by datawhalechina python
1323 NOASSERTION
李宏毅《深度强化学习》笔记,在线阅读地址:https://datawhalechina.github.io/leedeeprl-notes/
by deepmind python
774 Apache-2.0
by salesforce python
769 BSD-3-Clause
Foundation is a flexible, modular, and composable framework to model socio-economic behaviors and dynamics with both agents and governments. This framework can be used in conjunction with reinforcement learning to learn optimal economic policies, as done by the AI Economist (https://www.einstein.ai/the-ai-economist).
by PettingZoo-Team python
757 NOASSERTION
Gym for multi-agent reinforcement learning
by pfnet python
690 MIT
PFRL: a PyTorch-based deep reinforcement learning library
by DLR-RM python
639 MIT
A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
Top Authors in Reinforcement Learning
1
27 Libraries
10258
2
21 Libraries
43203
3
13 Libraries
14902
4
12 Libraries
14357
5
12 Libraries
1103
6
9 Libraries
752
7
9 Libraries
1942
8
8 Libraries
1445
9
7 Libraries
2478
10
7 Libraries
135
1
27 Libraries
10258
2
21 Libraries
43203
3
13 Libraries
14902
4
12 Libraries
14357
5
12 Libraries
1103
6
9 Libraries
752
7
9 Libraries
1942
8
8 Libraries
1445
9
7 Libraries
2478
10
7 Libraries
135
Trending Kits in Reinforcement Learning
No Trending Kits are available at this moment for Reinforcement Learning
Trending Discussions on Reinforcement Learning
tensorboard not showing results using ray rllib
Why does my model not learn? Very high loss
Action masking for continuous action space in reinforcement learning
Using BatchedPyEnvironment in tf_agents
Keras GradientType: Calculating gradients with respect to the output node
RuntimeError: Found dtype Double but expected Float - PyTorch
What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?
Weird-looking curve in DRL
keras-rl model with multiple outputs
no method matching logpdf when sampling from uniform distribution
QUESTION
tensorboard not showing results using ray rllib
Asked 2022-Mar-28 at 09:14I am trainig a reinforcement learning model on google colab using tune
and rllib
.
At first I was able to show the training results useing tensorboard but it is no longer working and I can't seem to find where it comes from, I didn't change anything so I feel a bit lost here.
What it shows (the directory is the right one) :
The training phase:
1ray.init(ignore_reinit_error=True)
2
3tune.run("PPO",
4 config = {"env" : CustomEnv2,
5 # "evaluation_interval" : 2,
6 # "evaluation_num_episodes" : 2,
7 "num_workers" :1},
8 num_samples=1,
9 # checkpoint_at_end=True,
10 stop={"training_iteration": 10},
11 local_dir = './test1')
12
Plotting results:
1ray.init(ignore_reinit_error=True)
2
3tune.run("PPO",
4 config = {"env" : CustomEnv2,
5 # "evaluation_interval" : 2,
6 # "evaluation_num_episodes" : 2,
7 "num_workers" :1},
8 num_samples=1,
9 # checkpoint_at_end=True,
10 stop={"training_iteration": 10},
11 local_dir = './test1')
12%load_ext tensorboard
13
14%tensorboard --logdir='/content/test1/PPO/PPO_CustomEnv2_024da_00000_0_2022-03-23_09-02-47'
15
ANSWER
Answered 2022-Mar-25 at 02:06You are using Rllib, right? I actually don't see the tensorboard file (i.e. events.out.tfevents.xxx.xxx) in your path. Maybe you should check if you have this file first.
QUESTION
Why does my model not learn? Very high loss
Asked 2022-Mar-25 at 10:49I built a simulation model where trucks collect garbage containers based on their fill level. I used OpenAi Gym and Tensorflow/keras to create my Deep Reinforcement Learning model... But my training has a very high loss... Where did I go wrong? Thanks in advance
this is the Env
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335
My model
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336 model = tf.keras.Sequential([
337 keras.layers.Dense(64, input_shape=states),
338 keras.layers.LeakyReLU(0.24,),
339 keras.layers.Dense(64),
340 keras.layers.LeakyReLU(0.24,),
341 keras.layers.Dense(32),
342 keras.layers.LeakyReLU(0.24,),
343
344 keras.layers.Dense(16),
345 keras.layers.LeakyReLU(0.24,),
346 keras.layers.Dense(8),
347 keras.layers.LeakyReLU(0.24,),
348
349 keras.layers.Dense(actions, activation='linear'),
350
351])
352 return model
353
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336 model = tf.keras.Sequential([
337 keras.layers.Dense(64, input_shape=states),
338 keras.layers.LeakyReLU(0.24,),
339 keras.layers.Dense(64),
340 keras.layers.LeakyReLU(0.24,),
341 keras.layers.Dense(32),
342 keras.layers.LeakyReLU(0.24,),
343
344 keras.layers.Dense(16),
345 keras.layers.LeakyReLU(0.24,),
346 keras.layers.Dense(8),
347 keras.layers.LeakyReLU(0.24,),
348
349 keras.layers.Dense(actions, activation='linear'),
350
351])
352 return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse', metrics=['accuracy'])
357
358
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336 model = tf.keras.Sequential([
337 keras.layers.Dense(64, input_shape=states),
338 keras.layers.LeakyReLU(0.24,),
339 keras.layers.Dense(64),
340 keras.layers.LeakyReLU(0.24,),
341 keras.layers.Dense(32),
342 keras.layers.LeakyReLU(0.24,),
343
344 keras.layers.Dense(16),
345 keras.layers.LeakyReLU(0.24,),
346 keras.layers.Dense(8),
347 keras.layers.LeakyReLU(0.24,),
348
349 keras.layers.Dense(actions, activation='linear'),
350
351])
352 return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse', metrics=['accuracy'])
357
358def build_agent(model, actions):
359 policy = GreedyQPolicy()
360 memory = SequentialMemory(limit=10000, window_length=1)
361 dqn = DQNAgent(model=model, memory=memory, policy=policy,nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
362
363 return dqn
364
1class Marltf(Env):
2 def __init__(self):
3
4 self.i= 0
5 self.containers1 = Container(3,3)
6 self.containers2 = Container(1,3)
7 self.containers3 = Container(3,1)
8 self.containers4 = Container(5,6)
9 self.containers5 = Container(8,6)
10 self.containers6 = Container(10,10)
11 self.containers7 = Container(11,11)
12 self.containers8 = Container(7,12)
13 self.passo = 0
14 self.containers2.lv = 2
15 self.containers3.lv = 4
16 self.containers5.lv = 4
17 self.containers6.lv = 1
18 self.containers8.lv = 2
19 self.shower_length= 300
20
21 self.containers = [self.containers1,self.containers2,self.containers3,self.containers4, self.containers5, self.containers6, self.containers7, self.containers8]
22 self.positions ={}
23 self.capacities ={}
24 self.camions= []
25 b = 0
26 for cont in self.containers:
27 b += cont.lv
28 reward = 0
29 nCamionFloat = 0
30 while b > 6:
31 b +=-10
32 nCamionFloat +=1
33 nCamionInt = int(nCamionFloat)
34
35 for ic in range(nCamionInt):
36 self.camions.append(Camion(1,1,None,ic))
37
38
39 for cam in self.camions:
40
41 self.positions[cam.name] = cam.position
42 self.capacities[cam.name] = 10
43
44
45 self.frames = []
46 self.cnt=0
47
48
49 self.mapp = Map(15,15,self.camions,self.containers)
50
51 self.state = (15*15)/5
52 self.action_space = gym.spaces.Discrete(4)
53 self.observation_space = Box(low = np.array([0]), high= np.array([51]))
54
55 def step(self, action):
56
57 moves = {0: (-1, 0),1: (1, 0),2: (0, -1),3: (0, 1)}
58
59 done = False
60
61 ic = 0
62 for cam in self.camions:
63 cam.position = (self.positions[ic][0],self.positions[ic][1])
64 cam.capacity = self.capacities[ic]
65
66 self.state += -5
67
68
69 mossa = moves[action]
70 x=self.camions[self.i].position
71 reward = 0
72 nuovaposizione = [mossa[0] + x[0],mossa[1] +x[1]]
73 self.shower_length -= 1
74 if self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] == -1:
75 reward += -5
76 self.state += -5
77
78 else:
79 self.mapp.mapp[x[0],x[1]] = 0
80 self.camions[self.i].position=nuovaposizione
81 self.mapp.mapp[nuovaposizione[0],nuovaposizione[1]] = 9
82 self.positions.update({self.camions[self.i].name : nuovaposizione})
83
84
85
86
87 reward += -1
88 self.state = -2
89
90
91 for contain in self.containers:
92 if self.camions[self.i].position[0] == contain.position[0] and camion.position[1] == contain.position[1] :
93
94 if contain.lv ==3 and self.camions[self.i].capacity >=3:
95 self.camions[self.i].reward += 100
96 self.camions[self.i].capacity += -3
97 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
98 reward +=20
99
100 self.state +=20
101
102 contain.lv=0
103
104 elif contain.lv == 2 and self.camions[self.i].capacity >=2:
105 self.camions[self.i].reward += 50
106 self.camions[self.i].capacity += -2
107 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
108 self.state +=10
109
110 reward += 50
111
112 contain.lv=0
113
114 elif contain.lv == 1 and self.camions[self.i].capacity >=1:
115
116 reward += 10
117 self.camions[self.i].reward +=5
118 self.camions[self.i].capacity += -1
119 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
120 contain.lv=0
121 self.state+=1
122
123 elif contain.lv==4 and self.camions[self.i].capacity >=4:
124 reward +=50
125 self.camions[self.i].reward +=50
126 self.camions[self.i].capacity += -4
127 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
128 self.state +=50
129 contain.lv=0
130
131
132 elif contain.lv==0 and self.camions[self.i].capacity >=4:
133 reward += -20
134 self.camions[self.i].reward +=-20
135 self.camions[self.i].capacity += 0
136 self.state += -20
137 contain.lv=0
138
139
140 if self.camions[self.i].capacity <=2:
141 self.camions[self.i].positions=(1,1)
142 self.positions.update({self.camions[self.i].name : (1,1)})
143
144 self.camions[self.i].capacity = 10
145 self.capacities.update({self.camions[self.i].name : self.camions[self.i].capacity})
146
147
148
149
150
151
152 if self.i ==1:
153 self.i= 0
154 self.i = 0
155 self.i = 0
156 elif self.i ==0:
157 self.i= 1
158
159
160 if self.shower_length <= 0:
161 done = True
162 else:
163 done = False
164
165
166 self.passo +=1
167
168
169
170
171
172
173 info = {}
174
175 return self.state,reward,done,info
176
177
178
179 def render(self, mode="human"):
180
181 BLACK = (0, 0, 0)
182 WHITE = (200, 200, 200)
183
184 WINDOW_HEIGHT = len(self.mapp.mapp[0]) *50
185 WINDOW_WIDTH = len(self.mapp.mapp[0]) *50
186
187 whiteC=pygame.image.load('white.jpg')
188 whiteC=pygame.transform.scale(whiteC,(50, 50))
189
190 greenC=pygame.image.load('green.jpg')
191 greenC=pygame.transform.scale(greenC,(50, 50))
192
193 yellowC=pygame.image.load('yellow.jpg')
194 yellowC=pygame.transform.scale(yellowC,(50, 50))
195
196 orangeC=pygame.image.load('orange.jpg')
197 orangeC=pygame.transform.scale(orangeC,(50, 50))
198
199 redC=pygame.image.load('red.jpg')
200 redC=pygame.transform.scale(redC,(50, 50))
201
202
203 gT=pygame.image.load('greenCamion.jpg')
204 gT=pygame.transform.scale(gT,(50, 50))
205
206 yT=pygame.image.load('yellowCamion.jpg')
207 yT=pygame.transform.scale(yT,(50, 50))
208
209 rT=pygame.image.load('redCamion.jpg')
210 rT=pygame.transform.scale(rT,(50, 50))
211
212
213
214
215 global SCREEN, CLOCK
216 pygame.init()
217 SCREEN = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
218 CLOCK = pygame.time.Clock()
219 SCREEN.fill(BLACK)
220
221 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( 10, 0, 50, 50))
222 blockSize = 50 #Set the size of the grid block
223
224 for i in range(0,len(self.mapp.mapp[0])):
225 for j in range(0,len(self.mapp.mapp[0])):
226 a=i*50
227 b=j*50
228
229 if self.mapp.mapp[i][j] == -1:
230 pygame.draw.rect(SCREEN, WHITE, pygame.Rect( a, b, 50, 50))
231
232 for c in self.camions :
233 if c.capacity > 6:
234 SCREEN.blit(gT, (c.position[0]*50, c.position[1]*50))
235
236 if c.capacity > 3 and c.capacity <= 6:
237 SCREEN.blit(yT, (c.position[0]*50, c.position[1]*50))
238
239 if c.capacity <= 3:
240 SCREEN.blit(rT, (c.position[0]*50, c.position[1]*50))
241
242
243 for contain in self.containers :
244 if contain.lv == 0:
245 SCREEN.blit(whiteC,(contain.position[0]*50 , contain.position[1]*50))
246
247 elif contain.lv == 1:
248 SCREEN.blit(greenC,(contain.position[0]*50 , contain.position[1]*50))
249
250 elif contain.lv == 2:
251 SCREEN.blit(yellowC,(contain.position[0]*50 , contain.position[1]*50))
252
253 elif contain.lv == 3:
254 SCREEN.blit(orangeC,(contain.position[0]*50 , contain.position[1]*50))
255
256 if contain.lv == 4:
257 SCREEN.blit(redC,(contain.position[0]*50 , contain.position[1]*50))
258
259
260
261 for x in range(0, WINDOW_WIDTH, blockSize):
262 for y in range(0, WINDOW_HEIGHT, blockSize):
263 rect = pygame.Rect(x, y, blockSize, blockSize)
264 pygame.draw.rect(SCREEN, WHITE, rect, 1)
265
266 pygame.display.flip()
267
268 view = pygame.surfarray.array3d(SCREEN)
269 view = view.transpose([1, 0, 2])
270
271 img_bgr = cv2.cvtColor(view, cv2.COLOR_RGB2BGR)
272
273
274
275
276
277 pygame.image.save(SCREEN, f"screenshot{self.cnt}.png")
278 self.cnt +=1
279 pygame.event.get()
280
281
282
283
284 def reset(self):
285 self.state = (15*15)/4
286 self.shower_length = 300
287
288 self.containers1.lv=3
289 self.containers2.lv=1
290 self.containers7.lv = 2
291 self.containers3.lv = 4
292 self.containers5.lv = 4
293 self.containers6.lv = 1
294 self.containers8.lv = 2
295 self.passo = 0
296 self.positions ={}
297 self.capacities ={}
298 self.camions= []
299 b = 0
300 for cont in self.containers:
301 b += cont.lv
302 reward = 0
303 nCamionFloat = 0
304 while b > 6:
305 b +=-10
306 nCamionFloat +=1
307 nCamionInt = int(nCamionFloat)
308
309 for ic in range(nCamionInt):
310 self.camions.append(Camion(1,1,None,ic))
311
312
313 for cam in self.camions:
314
315 self.positions[cam.name] = cam.position
316 self.capacities[cam.name] = 10
317
318 self.shower_length =60
319 self.cnt=0
320 self.i = 0
321
322
323
324
325
326
327
328 containers = [ containers1, containers2, containers3, containers4]
329 containers.append( containers1)
330
331
332states = env.observation_space.shape
333actions = env.action_space.n
334b = env.action_space.sample()
335def build_model(states,actions):
336 model = tf.keras.Sequential([
337 keras.layers.Dense(64, input_shape=states),
338 keras.layers.LeakyReLU(0.24,),
339 keras.layers.Dense(64),
340 keras.layers.LeakyReLU(0.24,),
341 keras.layers.Dense(32),
342 keras.layers.LeakyReLU(0.24,),
343
344 keras.layers.Dense(16),
345 keras.layers.LeakyReLU(0.24,),
346 keras.layers.Dense(8),
347 keras.layers.LeakyReLU(0.24,),
348
349 keras.layers.Dense(actions, activation='linear'),
350
351])
352 return model
353
354
355model = build_model(states, actions)
356model.compile(loss='mse', metrics=['accuracy'])
357
358def build_agent(model, actions):
359 policy = GreedyQPolicy()
360 memory = SequentialMemory(limit=10000, window_length=1)
361 dqn = DQNAgent(model=model, memory=memory, policy=policy,nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
362
363 return dqn
364
365
366dqn = build_agent(model, actions)
367dqn.compile(tf.keras.optimizers.Adadelta(
368 learning_rate=0.1, rho=0.95, epsilon=1e-07, name='Adadelta'), metrics= ["accuracy"]
369)
370
371a =dqn.fit(env, nb_steps=5000, visualize=True, verbose=2,)
372
373
the loss starts from 50 and reaches 200
ANSWER
Answered 2022-Mar-25 at 02:47loss does not really matter in RL. Very high loss is actually normal. In RL we care the reward most.
QUESTION
Action masking for continuous action space in reinforcement learning
Asked 2022-Mar-17 at 08:28Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried: For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?
ANSWER
Answered 2022-Mar-17 at 08:28I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions. I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.
QUESTION
Using BatchedPyEnvironment in tf_agents
Asked 2022-Feb-19 at 18:11I am trying to create a batched environment version of an SAC agent example from the Tensorflow Agents library, the original code can be found here. I am also using a custom environment.
I am pursuing a batched environment setup in order to better leverage GPU resources in order to speed up training. My understanding is that by passing batches of trajectories to the GPU, there will be less overhead incurred when passing data from the host (CPU) to the device (GPU).
My custom environment is called SacEnv
, and I attempt to create a batched environment like so:
1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4
My hope is that this will create a batched environment consisting of a 'batch' of non-batched environments. However I am receiving the following error when running the code:
1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5
with the stack trace:
1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5Traceback (most recent call last):
6 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 370, in <module>
7 app.run(main)
8 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 312, in run
9 _run_main(main, args)
10 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
11 sys.exit(main(argv))
12 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 366, in main
13 train_eval(FLAGS.root_dir)
14 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
15 utils.augment_exception_message_and_reraise(e, err_str)
16 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
17 raise proxy.with_traceback(exception.__traceback__) from None
18 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
19 return fn(*new_args, **new_kwargs)
20 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 274, in train_eval
21 results = metric_utils.eager_compute(
22 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
23 utils.augment_exception_message_and_reraise(e, err_str)
24 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
25 raise proxy.with_traceback(exception.__traceback__) from None
26 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
27 return fn(*new_args, **new_kwargs)
28 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py", line 163, in eager_compute
29 common.function(driver.run)(time_step, policy_state)
30 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
31 raise e.with_traceback(filtered_tb) from None
32 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 211, in run
33 return self._run_fn(
34 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py", line 188, in with_check_resource_vars
35 return fn(*fn_args, **fn_kwargs)
36 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 238, in _run
37 tf.while_loop(
38 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in loop_body
39 observer_ops = [observer(traj) for observer in self._observers]
40 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in <listcomp>
41 observer_ops = [observer(traj) for observer in self._observers]
42 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 93, in __call__
43 return self._update_state(*args, **kwargs)
44 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 81, in _update_state
45 return self.call(*arg, **kwargs)
46ValueError: in user code:
47
48 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py", line 176, in call *
49 self._return_accumulator.assign(
50
51 ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
52
53 In call to configurable 'eager_compute' (<function eager_compute at 0x7fa4d6e5e040>)
54 In call to configurable 'train_eval' (<function train_eval at 0x7fa4c8622dc0>)
55
I have dug through the tf_metric.py
code to try and understand the error, however I have been unsuccessful. A related issue was solved when I added the batch size (32) to the initializer for the AverageReturnMetric
instance, and this issue seems related.
The full code is:
1py_envs = [SacEnv() for _ in range(0, batch_size)]
2batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
3tf_env = tf_py_environment.TFPyEnvironment(batched_env)
4ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
5Traceback (most recent call last):
6 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 370, in <module>
7 app.run(main)
8 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 312, in run
9 _run_main(main, args)
10 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
11 sys.exit(main(argv))
12 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 366, in main
13 train_eval(FLAGS.root_dir)
14 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
15 utils.augment_exception_message_and_reraise(e, err_str)
16 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
17 raise proxy.with_traceback(exception.__traceback__) from None
18 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
19 return fn(*new_args, **new_kwargs)
20 File "/home/gary/Desktop/code/sac_test/sac_main2.py", line 274, in train_eval
21 results = metric_utils.eager_compute(
22 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1605, in gin_wrapper
23 utils.augment_exception_message_and_reraise(e, err_str)
24 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
25 raise proxy.with_traceback(exception.__traceback__) from None
26 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/gin/config.py", line 1582, in gin_wrapper
27 return fn(*new_args, **new_kwargs)
28 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/eval/metric_utils.py", line 163, in eager_compute
29 common.function(driver.run)(time_step, policy_state)
30 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
31 raise e.with_traceback(filtered_tb) from None
32 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 211, in run
33 return self._run_fn(
34 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/utils/common.py", line 188, in with_check_resource_vars
35 return fn(*fn_args, **fn_kwargs)
36 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 238, in _run
37 tf.while_loop(
38 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in loop_body
39 observer_ops = [observer(traj) for observer in self._observers]
40 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/drivers/dynamic_episode_driver.py", line 154, in <listcomp>
41 observer_ops = [observer(traj) for observer in self._observers]
42 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 93, in __call__
43 return self._update_state(*args, **kwargs)
44 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metric.py", line 81, in _update_state
45 return self.call(*arg, **kwargs)
46ValueError: in user code:
47
48 File "/home/gary/anaconda3/envs/py39/lib/python3.9/site-packages/tf_agents/metrics/tf_metrics.py", line 176, in call *
49 self._return_accumulator.assign(
50
51 ValueError: Cannot assign value to variable ' Accumulator:0': Shape mismatch.The variable shape (1,), and the assigned value shape (32,) are incompatible.
52
53 In call to configurable 'eager_compute' (<function eager_compute at 0x7fa4d6e5e040>)
54 In call to configurable 'train_eval' (<function train_eval at 0x7fa4c8622dc0>)
55# coding=utf-8
56# Copyright 2020 The TF-Agents Authors.
57#
58# Licensed under the Apache License, Version 2.0 (the "License");
59# you may not use this file except in compliance with the License.
60# You may obtain a copy of the License at
61#
62# https://www.apache.org/licenses/LICENSE-2.0
63#
64# Unless required by applicable law or agreed to in writing, software
65# distributed under the License is distributed on an "AS IS" BASIS,
66# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
67# See the License for the specific language governing permissions and
68# limitations under the License.
69
70# Lint as: python2, python3
71
72r"""Train and Eval SAC.
73
74All hyperparameters come from the SAC paper
75https://arxiv.org/pdf/1812.05905.pdf
76
77To run:
78
79```bash
80tensorboard --logdir $HOME/tmp/sac/gym/HalfCheetah-v2/ --port 2223 &
81
82python tf_agents/agents/sac/examples/v2/train_eval.py \
83 --root_dir=$HOME/tmp/sac/gym/HalfCheetah-v2/ \
84 --alsologtostderr
85\```
86"""
87
88from __future__ import absolute_import
89from __future__ import division
90from __future__ import print_function
91
92from sac_env import SacEnv
93
94import os
95import time
96
97from absl import app
98from absl import flags
99from absl import logging
100
101import gin
102from six.moves import range
103import tensorflow as tf # pylint: disable=g-explicit-tensorflow-version-import
104
105from tf_agents.agents.ddpg import critic_network
106from tf_agents.agents.sac import sac_agent
107from tf_agents.agents.sac import tanh_normal_projection_network
108from tf_agents.drivers import dynamic_step_driver
109#from tf_agents.environments import suite_mujoco
110from tf_agents.environments import tf_py_environment
111from tf_agents.environments import batched_py_environment
112from tf_agents.eval import metric_utils
113from tf_agents.metrics import tf_metrics
114from tf_agents.networks import actor_distribution_network
115from tf_agents.policies import greedy_policy
116from tf_agents.policies import random_tf_policy
117from tf_agents.replay_buffers import tf_uniform_replay_buffer
118from tf_agents.utils import common
119from tf_agents.train.utils import strategy_utils
120
121
122flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'),
123 'Root directory for writing logs/summaries/checkpoints.')
124flags.DEFINE_multi_string('gin_file', None, 'Path to the trainer config files.')
125flags.DEFINE_multi_string('gin_param', None, 'Gin binding to pass through.')
126
127FLAGS = flags.FLAGS
128
129gpus = tf.config.list_physical_devices('GPU')
130if gpus:
131 try:
132 for gpu in gpus:
133 tf.config.experimental.set_memory_growth(gpu, True)
134 logical_gpus = tf.config.experimental.list_logical_devices('GPU')
135 print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
136 except RuntimeError as e:
137 print(e)
138
139@gin.configurable
140def train_eval(
141 root_dir,
142 env_name='SacEnv',
143 # The SAC paper reported:
144 # Hopper and Cartpole results up to 1000000 iters,
145 # Humanoid results up to 10000000 iters,
146 # Other mujoco tasks up to 3000000 iters.
147 num_iterations=3000000,
148 actor_fc_layers=(256, 256),
149 critic_obs_fc_layers=None,
150 critic_action_fc_layers=None,
151 critic_joint_fc_layers=(256, 256),
152 # Params for collect
153 # Follow https://github.com/haarnoja/sac/blob/master/examples/variants.py
154 # HalfCheetah and Ant take 10000 initial collection steps.
155 # Other mujoco tasks take 1000.
156 # Different choices roughly keep the initial episodes about the same.
157 #initial_collect_steps=10000,
158 initial_collect_steps=2000,
159 collect_steps_per_iteration=1,
160 replay_buffer_capacity=31250, # 1000000 / 32
161 # Params for target update
162 target_update_tau=0.005,
163 target_update_period=1,
164 # Params for train
165 train_steps_per_iteration=1,
166 #batch_size=256,
167 batch_size=32,
168 actor_learning_rate=3e-4,
169 critic_learning_rate=3e-4,
170 alpha_learning_rate=3e-4,
171 td_errors_loss_fn=tf.math.squared_difference,
172 gamma=0.99,
173 reward_scale_factor=0.1,
174 gradient_clipping=None,
175 use_tf_functions=True,
176 # Params for eval
177 num_eval_episodes=30,
178 eval_interval=10000,
179 # Params for summaries and logging
180 train_checkpoint_interval=50000,
181 policy_checkpoint_interval=50000,
182 rb_checkpoint_interval=50000,
183 log_interval=1000,
184 summary_interval=1000,
185 summaries_flush_secs=10,
186 debug_summaries=False,
187 summarize_grads_and_vars=False,
188 eval_metrics_callback=None):
189 """A simple train and eval for SAC."""
190 root_dir = os.path.expanduser(root_dir)
191 train_dir = os.path.join(root_dir, 'train')
192 eval_dir = os.path.join(root_dir, 'eval')
193
194 train_summary_writer = tf.compat.v2.summary.create_file_writer(
195 train_dir, flush_millis=summaries_flush_secs * 1000)
196 train_summary_writer.set_as_default()
197
198 eval_summary_writer = tf.compat.v2.summary.create_file_writer(
199 eval_dir, flush_millis=summaries_flush_secs * 1000)
200 eval_metrics = [
201 tf_metrics.AverageReturnMetric(buffer_size=num_eval_episodes),
202 tf_metrics.AverageEpisodeLengthMetric(buffer_size=num_eval_episodes)
203 ]
204
205 global_step = tf.compat.v1.train.get_or_create_global_step()
206 with tf.compat.v2.summary.record_if(
207 lambda: tf.math.equal(global_step % summary_interval, 0)):
208
209
210 py_envs = [SacEnv() for _ in range(0, batch_size)]
211 batched_env = batched_py_environment.BatchedPyEnvironment(envs=py_envs)
212 tf_env = tf_py_environment.TFPyEnvironment(batched_env)
213
214 eval_py_envs = [SacEnv() for _ in range(0, batch_size)]
215 eval_batched_env = batched_py_environment.BatchedPyEnvironment(envs=eval_py_envs)
216 eval_tf_env = tf_py_environment.TFPyEnvironment(eval_batched_env)
217
218 time_step_spec = tf_env.time_step_spec()
219 observation_spec = time_step_spec.observation
220 action_spec = tf_env.action_spec()
221
222 strategy = strategy_utils.get_strategy(tpu=False, use_gpu=True)
223
224 with strategy.scope():
225 actor_net = actor_distribution_network.ActorDistributionNetwork(
226 observation_spec,
227 action_spec,
228 fc_layer_params=actor_fc_layers,
229 continuous_projection_net=tanh_normal_projection_network
230 .TanhNormalProjectionNetwork)
231 critic_net = critic_network.CriticNetwork(
232 (observation_spec, action_spec),
233 observation_fc_layer_params=critic_obs_fc_layers,
234 action_fc_layer_params=critic_action_fc_layers,
235 joint_fc_layer_params=critic_joint_fc_layers,
236 kernel_initializer='glorot_uniform',
237 last_kernel_initializer='glorot_uniform')
238
239 tf_agent = sac_agent.SacAgent(
240 time_step_spec,
241 action_spec,
242 actor_network=actor_net,
243 critic_network=critic_net,
244 actor_optimizer=tf.compat.v1.train.AdamOptimizer(
245 learning_rate=actor_learning_rate),
246 critic_optimizer=tf.compat.v1.train.AdamOptimizer(
247 learning_rate=critic_learning_rate),
248 alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
249 learning_rate=alpha_learning_rate),
250 target_update_tau=target_update_tau,
251 target_update_period=target_update_period,
252 td_errors_loss_fn=td_errors_loss_fn,
253 gamma=gamma,
254 reward_scale_factor=reward_scale_factor,
255 gradient_clipping=gradient_clipping,
256 debug_summaries=debug_summaries,
257 summarize_grads_and_vars=summarize_grads_and_vars,
258 train_step_counter=global_step)
259 tf_agent.initialize()
260
261 # Make the replay buffer.
262 replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
263 data_spec=tf_agent.collect_data_spec,
264 batch_size=batch_size,
265 max_length=replay_buffer_capacity,
266 device="/device:GPU:0")
267 replay_observer = [replay_buffer.add_batch]
268
269 train_metrics = [
270 tf_metrics.NumberOfEpisodes(),
271 tf_metrics.EnvironmentSteps(),
272 tf_metrics.AverageReturnMetric(
273 buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
274 tf_metrics.AverageEpisodeLengthMetric(
275 buffer_size=num_eval_episodes, batch_size=tf_env.batch_size),
276 ]
277
278 eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
279 initial_collect_policy = random_tf_policy.RandomTFPolicy(
280 tf_env.time_step_spec(), tf_env.action_spec())
281 collect_policy = tf_agent.collect_policy
282
283 train_checkpointer = common.Checkpointer(
284 ckpt_dir=train_dir,
285 agent=tf_agent,
286 global_step=global_step,
287 metrics=metric_utils.MetricsGroup(train_metrics, 'train_metrics'))
288 policy_checkpointer = common.Checkpointer(
289 ckpt_dir=os.path.join(train_dir, 'policy'),
290 policy=eval_policy,
291 global_step=global_step)
292 rb_checkpointer = common.Checkpointer(
293 ckpt_dir=os.path.join(train_dir, 'replay_buffer'),
294 max_to_keep=1,
295 replay_buffer=replay_buffer)
296
297 train_checkpointer.initialize_or_restore()
298 rb_checkpointer.initialize_or_restore()
299
300 initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
301 tf_env,
302 initial_collect_policy,
303 observers=replay_observer + train_metrics,
304 num_steps=initial_collect_steps)
305
306 collect_driver = dynamic_step_driver.DynamicStepDriver(
307 tf_env,
308 collect_policy,
309 observers=replay_observer + train_metrics,
310 num_steps=collect_steps_per_iteration)
311
312 if use_tf_functions:
313 initial_collect_driver.run = common.function(initial_collect_driver.run)
314 collect_driver.run = common.function(collect_driver.run)
315 tf_agent.train = common.function(tf_agent.train)
316
317 if replay_buffer.num_frames() == 0:
318 # Collect initial replay data.
319 logging.info(
320 'Initializing replay buffer by collecting experience for %d steps '
321 'with a random policy.', initial_collect_steps)
322 initial_collect_driver.run()
323
324 results = metric_utils.eager_compute(
325 eval_metrics,
326 eval_tf_env,
327 eval_policy,
328 num_episodes=num_eval_episodes,
329 train_step=global_step,
330 summary_writer=eval_summary_writer,
331 summary_prefix='Metrics',
332 )
333 if eval_metrics_callback is not None:
334 eval_metrics_callback(results, global_step.numpy())
335 metric_utils.log_metrics(eval_metrics)
336
337 time_step = None
338 policy_state = collect_policy.get_initial_state(tf_env.batch_size)
339
340 timed_at_step = global_step.numpy()
341 time_acc = 0
342
343 # Prepare replay buffer as dataset with invalid transitions filtered.
344 def _filter_invalid_transition(trajectories, unused_arg1):
345 return ~trajectories.is_boundary()[0]
346 dataset = replay_buffer.as_dataset(
347 sample_batch_size=batch_size,
348 num_steps=2).unbatch().filter(
349 _filter_invalid_transition).batch(batch_size).prefetch(5)
350 # Dataset generates trajectories with shape [Bx2x...]
351 iterator = iter(dataset)
352
353 def train_step():
354 experience, _ = next(iterator)
355 return tf_agent.train(experience)
356
357 if use_tf_functions:
358 train_step = common.function(train_step)
359
360 global_step_val = global_step.numpy()
361 while global_step_val < num_iterations:
362 start_time = time.time()
363 time_step, policy_state = collect_driver.run(
364 time_step=time_step,
365 policy_state=policy_state,
366 )
367 for _ in range(train_steps_per_iteration):
368 train_loss = train_step()
369 time_acc += time.time() - start_time
370
371 global_step_val = global_step.numpy()
372
373 if global_step_val % log_interval == 0:
374 logging.info('step = %d, loss = %f', global_step_val,
375 train_loss.loss)
376 steps_per_sec = (global_step_val - timed_at_step) / time_acc
377 logging.info('%.3f steps/sec', steps_per_sec)
378 tf.compat.v2.summary.scalar(
379 name='global_steps_per_sec', data=steps_per_sec, step=global_step)
380 timed_at_step = global_step_val
381 time_acc = 0
382
383 for train_metric in train_metrics:
384 train_metric.tf_summaries(
385 train_step=global_step, step_metrics=train_metrics[:2])
386
387 if global_step_val % eval_interval == 0:
388 results = metric_utils.eager_compute(
389 eval_metrics,
390 eval_tf_env,
391 eval_policy,
392 num_episodes=num_eval_episodes,
393 train_step=global_step,
394 summary_writer=eval_summary_writer,
395 summary_prefix='Metrics',
396 )
397 if eval_metrics_callback is not None:
398 eval_metrics_callback(results, global_step_val)
399 metric_utils.log_metrics(eval_metrics)
400
401 if global_step_val % train_checkpoint_interval == 0:
402 train_checkpointer.save(global_step=global_step_val)
403
404 if global_step_val % policy_checkpoint_interval == 0:
405 policy_checkpointer.save(global_step=global_step_val)
406
407 if global_step_val % rb_checkpoint_interval == 0:
408 rb_checkpointer.save(global_step=global_step_val)
409 return train_loss
410
411
412def main(_):
413 tf.compat.v1.enable_v2_behavior()
414 logging.set_verbosity(logging.INFO)
415 gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_param)
416 train_eval(FLAGS.root_dir)
417
418if __name__ == '__main__':
419 flags.mark_flag_as_required('root_dir')
420 app.run(main)
421
What is the appropriate way to create a batched environment for a custom, non-batched environment? I can share my custom environment, but I don't believe the issue lies there as the code works fine when using batch sizes of 1.
Also, any tips on increasing GPU utilization in reinforcement learning scenarios would be greatly appreciated. I have examined examples of using tensorboard-profiler to profile GPU utilization, but it seems these require callbacks and a fit
function, which doesn't seem to be applicable in RL use-cases.
ANSWER
Answered 2022-Feb-19 at 18:11It turns out I neglected to pass batch_size
when initializing the AverageReturnMetric
and AverageEpisodeLengthMetric
instances.
QUESTION
Keras GradientType: Calculating gradients with respect to the output node
Asked 2022-Jan-19 at 10:54For startes: this question does not ask for help regarding reinforcement learning (RL), RL is only used as an example.
The Keras documentation contains an example actor-critic reinforcement learning implementation using Gradient Tape. Basically, they've created a model with two separate outputs: one for the actor (n
actions) and one for the critic (1
reward). The following lines describe the backpropagation process (found somewhere in the code example):
1# Backpropagation
2loss_value = sum(actor_losses) + sum(critic_losses)
3grads = tape.gradient(loss_value, model.trainable_variables)
4optimizer.apply_gradients(zip(grads, model.trainable_variables))
5
Despite the fact that the actor and critic losses are calculated differently, they sum up those two losses to obtain the final loss value used for calculating the gradients.
When looking at this code example, one question came to my mind: Is there a way to calculate the gradients of the output layer with respect to the corresponding losses, i.e. calculate the gradients of the first n
output nodes based on the actor loss and the gradient of the last output node using the critic loss? For my understanding, this would be much more convenient than adding both losses (different!) and updating the gradients based on this cumulative approach. Do you agree?
ANSWER
Answered 2022-Jan-19 at 10:54Well, after some research I found the answer myself: It is possible to extract the trainable variables of a given layer based on the layer name. Then we can apply tape.gradient
and optimizer.apply_gradients
to the extracted set of trainable variables. My current solution is pretty slow, but it works. I just need to figure out how to improve its runtime.
QUESTION
RuntimeError: Found dtype Double but expected Float - PyTorch
Asked 2022-Jan-08 at 23:25I am new to pytorch and I am working on DQN for a timeseries using Reinforcement Learning and I needed to have a complex observation of timeseries and some sensor readings, so I merged two neural networks and I am not sure if that's what is ruining my loss.backward or something else.
I know there is multiple questions with the same title but none worked for me, maybe I am missing something.
First of all, this is my network:
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38
I have a function called calc_loss, and at its end it return the MSE loss as below
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38 print(state_action_values.dtype)
39 print(expected_state_action_values.dtype)
40 return nn.MSELoss()(state_action_values, expected_state_action_values)
41
and the print shows float32 and float64 respectively.
I get the error when I run the loss.backward() as below
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38 print(state_action_values.dtype)
39 print(expected_state_action_values.dtype)
40 return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45 optimizer.zero_grad()
46 loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47 print(loss_v.dtype)
48 print(loss_v)
49 loss_v.backward()
50 optimizer.step()
51
and the print output is as below:
torch.float64
tensor(1887.4831, dtype=torch.float64, grad_fn=)
Update 1:
I tried using a simpler model, yet the same issue, when I tried to cast the inputs to Float, I got an error:
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38 print(state_action_values.dtype)
39 print(expected_state_action_values.dtype)
40 return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45 optimizer.zero_grad()
46 loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47 print(loss_v.dtype)
48 print(loss_v)
49 loss_v.backward()
50 optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52
What makes the model expects double ?
Update 2:
I tried to add the below line on top after the torch import but same issue of RuntimeError: Found dtype Double but expected Float
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38 print(state_action_values.dtype)
39 print(expected_state_action_values.dtype)
40 return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45 optimizer.zero_grad()
46 loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47 print(loss_v.dtype)
48 print(loss_v)
49 loss_v.backward()
50 optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52>>> torch.set_default_tensor_type(torch.FloatTensor)
53
But when I used the DoubleTensor I got:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.DoubleTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ANSWER
Answered 2022-Jan-08 at 23:25The issue wasn't in the input to the network but the criterion of the MSELoss, so it worked fine after casting the criterion to float as below
1class DQN(nn.Module):
2 def __init__(self, list_shape, score_shape, n_actions):
3 super(DQN, self).__init__()
4
5 self.FeatureList = nn.Sequential(
6 nn.Conv1d(list_shape[1], 32, kernel_size=8, stride=4),
7 nn.ReLU(),
8 nn.Conv1d(32, 64, kernel_size=4, stride=2),
9 nn.ReLU(),
10 nn.Conv1d(64, 64, kernel_size=3, stride=1),
11 nn.ReLU(),
12 nn.Flatten()
13 )
14
15 self.FeatureScore = nn.Sequential(
16 nn.Linear(score_shape[1], 512),
17 nn.ReLU(),
18 nn.Linear(512, 128)
19 )
20
21 t_list_test = torch.zeros(list_shape)
22 t_score_test = torch.zeros(score_shape)
23 merge_shape = self.FeatureList(t_list_test).shape[1] + self.FeatureScore(t_score_test).shape[1]
24
25 self.FinalNN = nn.Sequential(
26 nn.Linear(merge_shape, 512),
27 nn.ReLU(),
28 nn.Linear(512, 128),
29 nn.ReLU(),
30 nn.Linear(128, n_actions),
31 )
32
33 def forward(self, list, score):
34 listOut = self.FeatureList(list)
35 scoreOut = self.FeatureScore(score)
36 MergedTensor = torch.cat((listOut,scoreOut),1)
37 return self.FinalNN(MergedTensor)
38 print(state_action_values.dtype)
39 print(expected_state_action_values.dtype)
40 return nn.MSELoss()(state_action_values, expected_state_action_values)
41LEARNING_RATE = 0.01
42optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
43
44for i in range(50):
45 optimizer.zero_grad()
46 loss_v = calc_loss(sample(obs, 500, 200, 64), net, tgt_net)
47 print(loss_v.dtype)
48 print(loss_v)
49 loss_v.backward()
50 optimizer.step()
51RuntimeError: expected scalar type Double but found Float
52>>> torch.set_default_tensor_type(torch.FloatTensor)
53return nn.MSELoss()(state_action_values.float(), expected_state_action_values.float())
54
I decided to leave the answer for beginners like me who might be stuck and didn't expect to check the datatype of the loss criterion
QUESTION
What is the purpose of [np.arange(0, self.batch_size), action] after the neural network?
Asked 2021-Dec-23 at 11:07I followed a PyTorch tutorial to learn reinforcement learning(TRAIN A MARIO-PLAYING RL AGENT) but I am confused about the following code:
1current_Q = self.net(state, model="online")[np.arange(0, self.batch_size), action] # Q_online(s,a)
2
What's the purpose of [np.arange(0, self.batch_size), action] after the neural network?(I know that TD_estimate takes in state and action, just confused about this on the programming side) What is this usage(put a list after self.net)?
More related code referenced from the tutorial:
1current_Q = self.net(state, model="online")[np.arange(0, self.batch_size), action] # Q_online(s,a)
2class MarioNet(nn.Module):
3
4def __init__(self, input_dim, output_dim):
5 super().__init__()
6 c, h, w = input_dim
7
8 if h != 84:
9 raise ValueError(f"Expecting input height: 84, got: {h}")
10 if w != 84:
11 raise ValueError(f"Expecting input width: 84, got: {w}")
12
13 self.online = nn.Sequential(
14 nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
15 nn.ReLU(),
16 nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
17 nn.ReLU(),
18 nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
19 nn.ReLU(),
20 nn.Flatten(),
21 nn.Linear(3136, 512),
22 nn.ReLU(),
23 nn.Linear(512, output_dim),
24 )
25
26 self.target = copy.deepcopy(self.online)
27
28 # Q_target parameters are frozen.
29 for p in self.target.parameters():
30 p.requires_grad = False
31
32def forward(self, input, model):
33 if model == "online":
34 return self.online(input)
35 elif model == "target":
36 return self.target(input)
37
self.net:
1current_Q = self.net(state, model="online")[np.arange(0, self.batch_size), action] # Q_online(s,a)
2class MarioNet(nn.Module):
3
4def __init__(self, input_dim, output_dim):
5 super().__init__()
6 c, h, w = input_dim
7
8 if h != 84:
9 raise ValueError(f"Expecting input height: 84, got: {h}")
10 if w != 84:
11 raise ValueError(f"Expecting input width: 84, got: {w}")
12
13 self.online = nn.Sequential(
14 nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
15 nn.ReLU(),
16 nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
17 nn.ReLU(),
18 nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
19 nn.ReLU(),
20 nn.Flatten(),
21 nn.Linear(3136, 512),
22 nn.ReLU(),
23 nn.Linear(512, output_dim),
24 )
25
26 self.target = copy.deepcopy(self.online)
27
28 # Q_target parameters are frozen.
29 for p in self.target.parameters():
30 p.requires_grad = False
31
32def forward(self, input, model):
33 if model == "online":
34 return self.online(input)
35 elif model == "target":
36 return self.target(input)
37self.net = MarioNet(self.state_dim, self.action_dim).float()
38
Thanks for any help!
ANSWER
Answered 2021-Dec-23 at 11:07Essentially, what happens here is that the output of the net is being sliced to get the desired part of the Q table.
The (somewhat confusing) index of [np.arange(0, self.batch_size), action]
indexes each axis. So, for axis with index 1, we pick the item indicated by action
. For index 0, we pick all items between 0 and self.batch_size
.
If self.batch_size
is the same as the length of dimension 0 of this array, then this slice can be simplified to [:, action]
which is probably more familiar to most users.
QUESTION
Weird-looking curve in DRL
Asked 2021-Dec-12 at 14:38I have a deep reinforcement learning agent that interacts with a customized environment and I am displaying the reward value every episode using tensorboard. The curve looks like this
For some reason it jumps to step 80 after step 17 every time and I cannot understand why, I don't even know what part of the code I should copy paste here.
Anyone has any idea why it does that ?
ANSWER
Answered 2021-Dec-12 at 14:38Turns out the step number is getting incremented elsewhere, commented that line and it works fine now.
QUESTION
keras-rl model with multiple outputs
Asked 2021-Dec-02 at 12:27I want to build a reinforcement learning model with keras which needs to have two outputs. can it be done the same way that the Keras library does or is it even doable?
this is what I want to do
1inp = Input(shape=(input_layer_size, ))
2x = Dense(hidden_layer_size, activation="relu")(inp)
3for i in range(nb_hidden_layer):
4 x = Dense(hidden_layer_size, activation="relu")(x)
5a1 = Dense(1, activation='sigmoid')(x)
6a2 = Dense(1, activation='sigmoid')(x)
7
8
ANSWER
Answered 2021-Dec-02 at 12:27yes, it is possible, just use:
1inp = Input(shape=(input_layer_size, ))
2x = Dense(hidden_layer_size, activation="relu")(inp)
3for i in range(nb_hidden_layer):
4 x = Dense(hidden_layer_size, activation="relu")(x)
5a1 = Dense(1, activation='sigmoid')(x)
6a2 = Dense(1, activation='sigmoid')(x)
7
8model = Model(inp, [a1,a2])
9
and pay attention to the order of your outputs to don't mistaken them.
QUESTION
no method matching logpdf when sampling from uniform distribution
Asked 2021-Nov-18 at 23:01I am trying to use reinforcement learning in julia to teach a car that is constantly being accelerated backwards (but with a positive initial velocity) to apply brakes so that it gets as close to a target distance as possible before moving backwards.
To do this, I am making use of POMDPs.jl
and crux.jl
which has many solvers (I'm using DQN). I will list what I believe to be the relevant parts of the script first, and then more of it towards the end.
To define the MDP, I set the initial position, velocity, and force from the brakes as a uniform distribution over some values.
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7
My state holds the values of (position, velocity, brake force)
, and the initial state is given as:
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10
Then, I set up my DQN solver using crux.jl
and called a function to solve for the policy
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12
calling solve()
gives me the error MethodError: no method matching logpdf(::Distributions.Categorical{Float64, Vector{Float64}}, ::Nothing)
. I am quite sure that this comes from the initial state sampling, but I am not sure why or how to fix it. I have only been learning RL from various books and online lectures for a very short time, so any help regarding the error or my the model I set up (or anything else I'm oblivious to) would be appreciated.
More comprehensive code:
Packages:
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24
Rest of it:
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
25 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
26 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
27 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
28
29 m::Float64 = 1.
30 tension::Float64 = 3.
31 dmax::Float64 = 2.
32 target::Float64 = 80.
33 dt::Float64 = .05
34
35 γ::Float32 = 1.
36 actions::Vector{Float64} = [-.1, 0., .1]
37end
38
39function POMDPs.gen(env::SliderMDP, s, a, rng::AbstractRNG = Random.GLOBAL_RNG)
40 x, ẋ, d = s
41
42 if x >= env.target
43 a = .1
44 end
45 if d+a >= env.dmax || d+a <= 0
46 a = 0.
47 end
48
49 force = (d + env.tension) * -1
50 ẍ = force/env.m
51
52 # Simulation
53 x_ = x + env.dt * ẋ
54 ẋ_ = ẋ + env.dt * ẍ
55 d_ = d + a
56
57 sp = vcat(x_, ẋ_, d_)
58 reward = abs(env.target - x) * -1
59
60 return (sp=sp, r=reward)
61end
62
63
64
65function POMDPs.initialstate(mdp::SliderMDP)
66 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
67end
68
69POMDPs.isterminal(mdp::SliderMDP, s) = s[2] <= 0
70POMDPs.discount(mdp::SliderMDP) = mdp.γ
71
72mdp = SliderMDP();
73s = state_space(mdp); # Using Crux.jl
74
75function Q_network()
76 layer1 = Dense(3, 64, relu)
77 layer2 = Dense(64, 64, relu)
78 layer3 = Dense(64, length(3))
79 return DiscreteNetwork(Chain(layer1, layer2, layer3), [-.1, 0, .1])
80end
81
82solver_dqn = DQN(π=Q_network(), S=s, N=30000) # Using Crux.jl
83policy_dqn = solve(solver_dqn, mdp) # Error comes here
84
85
Stacktrace:
1@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
2 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
3 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
4 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
5 ...
6end
7function POMDPs.initialstate(mdp::SliderMDP)
8 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
9end
10solver_dqn = DQN(π=Q_network(), S=s, N=30000)
11policy_dqn = solve(solver_dqn, mdp)
12using POMDPs
13using POMDPModelTools
14using POMDPPolicies
15using POMDPSimulators
16
17using Parameters
18using Random
19
20using Crux
21using Flux
22
23using Distributions
24@with_kw struct SliderMDP <: MDP{Array{Float32}, Array{Float32}}
25 x0 = Distributions.Uniform(0., 80.)# Distribution to sample initial position
26 v0 = Distributions.Uniform(0., 25.) # Distribution to sample initial velocity
27 d0 = Distributions.Uniform(0., 2.) # Distribution to sample brake force
28
29 m::Float64 = 1.
30 tension::Float64 = 3.
31 dmax::Float64 = 2.
32 target::Float64 = 80.
33 dt::Float64 = .05
34
35 γ::Float32 = 1.
36 actions::Vector{Float64} = [-.1, 0., .1]
37end
38
39function POMDPs.gen(env::SliderMDP, s, a, rng::AbstractRNG = Random.GLOBAL_RNG)
40 x, ẋ, d = s
41
42 if x >= env.target
43 a = .1
44 end
45 if d+a >= env.dmax || d+a <= 0
46 a = 0.
47 end
48
49 force = (d + env.tension) * -1
50 ẍ = force/env.m
51
52 # Simulation
53 x_ = x + env.dt * ẋ
54 ẋ_ = ẋ + env.dt * ẍ
55 d_ = d + a
56
57 sp = vcat(x_, ẋ_, d_)
58 reward = abs(env.target - x) * -1
59
60 return (sp=sp, r=reward)
61end
62
63
64
65function POMDPs.initialstate(mdp::SliderMDP)
66 ImplicitDistribution((rng) -> Float32.([rand(rng, mdp.x0), rand(rng, mdp.v0), rand(rng, mdp.d0)]))
67end
68
69POMDPs.isterminal(mdp::SliderMDP, s) = s[2] <= 0
70POMDPs.discount(mdp::SliderMDP) = mdp.γ
71
72mdp = SliderMDP();
73s = state_space(mdp); # Using Crux.jl
74
75function Q_network()
76 layer1 = Dense(3, 64, relu)
77 layer2 = Dense(64, 64, relu)
78 layer3 = Dense(64, length(3))
79 return DiscreteNetwork(Chain(layer1, layer2, layer3), [-.1, 0, .1])
80end
81
82solver_dqn = DQN(π=Q_network(), S=s, N=30000) # Using Crux.jl
83policy_dqn = solve(solver_dqn, mdp) # Error comes here
84
85policy_dqn
86MethodError: no method matching logpdf(::Distributions.Categorical{Float64, Vector{Float64}}, ::Nothing)
87
88Closest candidates are:
89
90logpdf(::Distributions.DiscreteNonParametric, !Matched::Real) at C:\Users\name\.julia\packages\Distributions\Xrm9e\src\univariate\discrete\discretenonparametric.jl:106
91
92logpdf(::Distributions.UnivariateDistribution{S} where S<:Distributions.ValueSupport, !Matched::AbstractArray) at deprecated.jl:70
93
94logpdf(!Matched::POMDPPolicies.PlaybackPolicy, ::Any) at C:\Users\name\.julia\packages\POMDPPolicies\wMOK3\src\playback.jl:34
95
96...
97
98logpdf(::Crux.ObjectCategorical, ::Float32)@utils.jl:16
99logpdf(::Crux.DistributionPolicy, ::Vector{Float64}, ::Float32)@policies.jl:305
100var"#exploration#133"(::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(Crux.exploration), ::Crux.DistributionPolicy, ::Vector{Float64})@policies.jl:302
101exploration@policies.jl:297[inlined]
102action(::Crux.DistributionPolicy, ::Vector{Float64})@policies.jl:294
103var"#exploration#136"(::Crux.DiscreteNetwork, ::Int64, ::typeof(Crux.exploration), ::Crux.MixedPolicy, ::Vector{Float64})@policies.jl:326
104var"#step!#173"(::Bool, ::Int64, ::typeof(Crux.step!), ::Dict{Symbol, Array}, ::Int64, ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace})@sampler.jl:55
105var"#steps!#174"(::Int64, ::Bool, ::Int64, ::Bool, ::Bool, ::Bool, ::typeof(Crux.steps!), ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace})@sampler.jl:108
106var"#fillto!#177"(::Int64, ::Bool, ::typeof(Crux.fillto!), ::Crux.ExperienceBuffer{Array}, ::Crux.Sampler{Main.workspace#2.SliderMDP, Vector{Float32}, Crux.DiscreteNetwork, Crux.ContinuousSpace{Tuple{Int64}}, Crux.DiscreteSpace}, ::Int64)@sampler.jl:156
107solve(::Crux.OffPolicySolver, ::Main.workspace#2.SliderMDP)@off_policy.jl:86
108top-level scope@Local: 1[inlined]
109
ANSWER
Answered 2021-Nov-18 at 23:01Short answer:
Change your output vector to Float32
i.e. Float32[-.1, 0, .1]
.
Long answer:
Crux creates a Distribution
over your network's output values, and at some point (policies.jl:298) samples a random value from it. It then converts this value to a Float32
. Later (utils.jl:15) it does a findfirst
to find the index of this value in the original output array (stored as objs
within the distribution), but because the original array is still Float64
, this fails and returns a nothing
. Hence the error.
I believe this (converting the sampled value but not the objs
array and/or not using approximate equality check i.e. findfirst(isapprox(x), d.objs)
) to be a bug in the package, and would encourage you to raise this as an issue on Github.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Reinforcement Learning
Tutorials and Learning Resources are not available at this moment for Reinforcement Learning