We apply the Bellman Expectation Equation and selecting the utility that is calculated by taking the most optimal action given the current utility:
This iterative process is called the Bellman backup, or Bellman update.
eventually will converge into the optimal value function. After which, we just extract the greedy policy from the utility to get a policy to use. We stop when the Bellman Residual hits a the desired error threshold: Bellman Residual Take the L-\infty norm of U^{k+1}-U^{k} (that is, take ||U_{k+1} - U_{k}||_{\infty}. We call that the Bellman Residual. If this Bellman Residual drops below \delta, it is shown that the error between U^{*} (convergence) and U_{k} will only be:
So as long as the Bellman Residual between your two updates \leq \delta, you know that you are at most \epsilon away from the optimal utility. You will note that as future discount \gamma \to 1, this error bound becomes much larger. Therefore, you have to iterate more to get to the same \epsilon. You need more iterations when \gamma \to 1. Notably, the loss of some arbitrary utility derived from policy evaluation is:
asynchronous value iteration We choose an ordering of states. We then loop through the entire list, updating the value function. Then, we loop through this system multiple times until the system converged. That is, instead of creating a list of things U_{k}, keeping only the current current one in memory, we come up with some:
The idea is, instead of keeping all of the U_{k-1} until you have calculated all of U_{k} for each state, we just use an ordering of the states to just use whatever value we calculated last. time complexity \begin{equation} O(S^{2}A) \end{equation} where S is the number of states and A the number of actions. loop over all states in each update loop over all actions to figure out the max loop over all next states and calculate their utility POMDP value-iteration compute alpha vectors for all one-step plans (i.e. conditional plans that does just one action and gives up) alpha vector pruning on any plans that are dominated generate all possible two-step conditional plans over all actions using combinations of non-pruned one-step plans above as SUBPLANS (yes, you can use a one-step plan twice) repeat steps 2-3 see also performing value-iteration naively with one-step lookahead in POMDP. POMDP Bellman Update Say you want to extract a policy out of a bunch of alpha vectors. Let \alpha \in \Gamma, a set of alpha vectors; we obtain a new alpha vector U’(b) = [U(s_0) \dots U(s_{n})] by:
where:
and