model-based reinforcement learning

Step 1: Getting Model We want a model
T: transition probability R: rewards Maximum Likelihood Parameter Learning Method \begin{equation} N(s,a,s’) \end{equation}
which is the count of transitions from s,a to s’ and increment it as s, a, s’ gets observed. This makes, with Maximum Likelihood Parameter Learning:

\begin{equation} T(s’ | s,a) = \frac{N(s,a,s’)}{\sum_{s’’}^{} N(s,a,s’’)} \end{equation}

We also keep a table:

\begin{equation} p(s,a) \end{equation}

the sum of rewards when taking s,a. To calculate a reward, we take the average:

\begin{equation} R(s,a) \approx \frac{p(s,a)}{\sum_{s’’}^{}N(s,a,s’’)} \end{equation}

Baysian Parameter Learning Method We build a Dirichlet Distribution; let:

\begin{equation} \vec{\theta}_{(s,a)} = \mqty[T(s_1 | s,a) \\ \dots\\ T(s_{n} | s,a)] \end{equation}

We then calculate a distribution:

\begin{equation} Dir(\vec{\theta}_{s,a} | \vec{N}_{s,a}) \end{equation}

which will give you a probability over a set of transitions.
Then, when we need a transition T, we perform Posterior Sampling on this Dirichlet Distribution at every episode (or so, otherwise the model shifts a lot) and then optimize on that.
Getting rewards is an advanced topic. So let’s just use Maximum Likelihood Parameter Learning or assume its given
Step 2: Getting Value Function. full update One direct strategy to work on this, then, is to use whatever transition and rewards you observed to perform value iteration or policy iteration. First go through one or a bunch of observations, then take a full value iteration or policy iteration sweep, and then go back and take more measurements.
randomized update We randomly update a single state:

\begin{equation} U(s) \leftarrow \max_{a} \left[R(s,a) + \gamma \sum_{s’}^{} T(s’ | s,a) U(s’)\right] \end{equation}

and take another observation, update our model estimate, and move on.
prioritized updates Say we are current updating a state s, and there are two previous states that could transition into s. First we create an estimate like before:

\begin{equation} U(s) \leftarrow \max_{a} \left[R(s,a) + \gamma \sum_{s’}^{} T(s’ | s,a) U(s’)\right] \end{equation}

We create a queue whose contents are ranked by:

\begin{equation} T(s|s^{-}, a^{-}) \times |U(s)-u(s)| \end{equation}

where u(s) is U(s) prior to the update.
We move on to the next state to update by popping off the queue.
Step 3: Explore a Little epsilon-greedy exploration with decay Softmax Method R-Max Most strategies above focuses on choosing a random action. This exploration focuses on adapting reward/transitions to explicitly explore new-state.

\begin{equation} R(s,a) = \begin{cases} r_{\max}, if N(s,a) < m,\\ \rho\frac{s,a}{N(s,a)}, otherwise \end{cases} \end{equation}

you get a large reward r_{\max } if you haven’t been to (s,a), otherwise the reward you get gets discounted by the number of times you visited.

\begin{equation} T(s’|s,a) = \begin{cases} 1, if N(s,a) < m\ and\ s’ = s \\ 0, if N(s,a) < m\ and\ s’ \neq s \\ \frac{N(s,a,s’)}{N(s,a)}, otherwise \end{cases} \end{equation}