Key Sequence Notation continuous state MDP New Concepts Important Results / Claims Questions Interesting Factoids “sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast” Debugging RL RL should work when The simulator is good The RL algorithm correctly maximize V^{\pi} Reward such that maximum expected payoff corresponds to your goal Diagnostics check your simulator: if your policy works in sim but not IRL, your sim is bad if V^{\text{RL}} < V^{\text{human}}, then your RL algorithm is just bad if V^{\text{RL}} \geq V^{\text{human}}, then your objective function is bad