Your robot will have two motors driving two treads. It will also have a simple trailor, with rotation sensors on its wheel and on the joint to the robot.
The state space of the robot will be the 8 readings of the rotation sensor on the joint. The action space will be 5 choices: go straight forward, go straight back, back left, back right, hard back left, hard back right. Your robot can move in a jerky manner, so that each decision is really only dependent on the current state. The reward for the robot is +1 for rolling the trailor forward and zero otherwise.
You should use Q-learning to train your robot. Use a discount of 0.9. You could also use the "Dyna" approach to speed up learning.
We will test your robot on different surfaces and with different trailor lengths. Your robot should learn continually (so use a non-decaying learning rate).
Questions:
Given an MDP and a policy (mapping of states to actions), can work out value for each state by using linear algebra.
Given a value function, we can get the policy, by computing Q values. Q(s,a) is the expected value for taking action a from state s. The policy is to always choose, in state s, the action a that maximizes Q(s,a).
We experimented with MDPs and the key concepts using a simulator called CHARLEE by Anya Bilska.
Truck pushing: iterate over values. Six movement choices. If robot knows that half the time it goes slightly left, it moves hard left. Probability of that happening is 50 percent.
Each time, reward decreases because v goes down and v is a function of itself: 1st time, 0.9; 2nd time, 0.81; 3rd time 0.73, etc. Keeps reward from getting to infinity.