I am trying out the PyBrains maze example
my setup is:
envmatrix = [[...]]
env = Maze(envmatrix, (1, 8))
task = MDPMazeTask(env)
table = ActionValueTable(states_nr, actions_nr)
table.initialize(0.)
learner = Q()
agent = LearningAgent(table, learner)
experiment = Experiment(task, agent)
for i in range(1000):
experiment.doInteractions(N)
agent.learn()
agent.reset()
Now, I am not confident in the results that I am getting

The bottom-right corner (1, 8) is the absorbing state
I have put an additional punishment state (1, 7) in mdp.py:
def getReward(self):
""" compute and return the current reward (i.e. corresponding to the last action performed) """
if self.env.goal == self.env.perseus:
self.env.reset()
reward = 1
elif self.env.perseus == (1,7):
reward = -1000
else:
reward = 0
return reward
Now, I do not understand how, after 1000 runs and 200 interaction during every run, agent thinks that my punishment state is a good state (you can see the square is white)
I would like to see the values for every state and policy after the final run. How do I do that? I have found that this line table.params.reshape(81,4).max(1).reshape(9,9) returns some values, but I am not sure whether those correspond to values of the value function
Now I added another constraint - made the agent to always start from the same position: (1, 1) by adding
self.initPos = [(1, 1)]in maze.py and now I get this behaviour after 1000 runs with each run having 200 interactions:Which kind of makes sense now - the robot tries to go around the wall from another side, avoiding the state (1, 7)
So, I was getting weird results because the agent used to start from random positions, which also included the punishing state
EDIT:
Another point is that if it is desirable to spawn the agent randomly, then make sure it is not spawned in the punishable state
Also, seems then that
table.params.reshape(81,4).max(1).reshape(9,9)returns the value for every state from the value function