I'm researching GridWorld from Q-learning Perspective. I have issues regarding the following question:
1) In the grid-world example, rewards are positive for goals, negative
for running into the edge of the world, and zero the rest of the time.
Are the signs of these rewards important, or only the intervals
between them?
Keep in mind that Q-values are expected values. The policy will extracted by choosing the action that maximises the Q function for each given state.
Notice that you can apply constant value to all Q-values without affecting the policy. It doesn't matter if you shift all the q-values by applying some constant value, the relation between the q-values with respect to max will still be the same. In fact, you can apply any affine transformation (Q'= a*Q+b) and your decisions will not change.