Optimal Exploration Policy

Suppose we have offline statistic regarding wins and losses of each slot machine as our state:

\begin{equation} w_1, l_{1}, \dots, w_{n}, l_{n} \end{equation}

What if we want to create a policy that maximises exploration?
We construct a value function:

\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}

our policy is the greedy policy:

\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \arg\max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}

Now, how do we go about calculating the action-value:

\begin{align} Q ([w_1, l_{1}, \dots, w_{n}, l_{n}], a) =\ & \frac{w_{a}+1}{w_{a}+l_{a}+2} (R(w) + U^{*}(\dots, w_{a}+1, l_{a}, \dots)) \&+ \left(1-\frac{w_{a}+1}{w_{a}+l_{a}+2}\right)(R(l) + U^{*}(\dots, w_{a}, l_{a}+1, \dots)) \end{align}

“the probability of you win” (expectation of Beta Distribution), times the instantaneous reward you win + the utility you gain in terms of information of you doing that.
To solve this in a finite horizon, note that at time t=k, your U^{*} is 0 because you have nothing to do anymore.
Then, you can back up slowly to get each previous state:
calculate Q[w_1-1, l_1, …, 1] calculate Q[w_1, l_1-1, …,1] and so on, and choosing the utility of each state from there.