SU-CS229 NOV062025

Key Sequence Notation New Concepts Markov Decision Process Bellman Equation optimal policy value iteration Important Results / Claims Questions Interesting Factoids 229 MDP notation S (state), A (actions), P_{(s,a)}\left(s’\right) = T\left(s’ | s,a\right) , \gamma (discount), R\left(s,a\right).
FUN FACT: discount factors < 1 makes value iteration converge.

\begin{equation} V^{\pi}\left(s\right) = \mathbb{E}\left[R\left(s_{0},a_{0}\right) + \gamma R\left(s_{1}, a_{1}\right) + \gamma^{2} \dots\right] \end{equation}

\begin{equation} V^{\pi} \left(s\right) = R\left(s\right) + \gamma \sum_{s’}^{} P_{s,\pi\left(s\right)}\left(s’\right) V^{\pi}\left(s’\right) \end{equation}

\begin{equation} V^{*}\left(s\right) = \max_{\pi} V^{\pi}\left(s\right) \end{equation}

What if we don’t know the transitions? Just learn the transitions! exportation exploitation.