Rollout with Lookahead

Ingredients:
\mathcal{P} problem (states, transitions, etc.) \pi a Rollout Policy d depth (how many next states to look into)—more is more accurate but slower Use the greedy policy at each state by using the Rollout procedure to estimate your value function at any given state.
Rollout Rollout works by hallucinating a trajectory and calculating the reward.
For some state, Rollout Policy, and depth…
let ret be 0; for i in range depth take action following the Rollout Policy obtain a sample of possible next state (weighted by the action you took, meaning an instantiation of s’ \sim T(\cdot | s,a)) and reward R(s,a) from current state ret += gamma^i * r return ret Rollout Policy A Rollout Policy is a default policy used for lookahead. Usually this policy should be designed with domain knowledge; if not, we just use a uniform random next steps.