Big picture: combining off-line and on-line approaches maybe the best way to tackle large POMDPs. Try planning: only where we are only where we can reach Take into account three factors: uncertainty in the value function reachability from the current belief actions that are likely optimal It allows policy improvement on any base policy. Setup Discrete POMDPs: L, lower bound U, upper-bound b_0: current belief Two main phases: the algorithm Planning Phase at each belief point, choose a particular next node to expand (using the scheme below to score the nodes) expand that next node that are chosen propagate the value of the belief upwards through POMDP Bellman Backup up through the tree Best Node Selection We select the best node by three metrics: Uncertainty: \epsilon(b) = U(b)-L(b) we want small gap between upper and lower bound Optimality in actions: AEMS1: p(a|b) = \frac{U(a,b)-L(b)}{U(a^{*}, b)-L(b)} (“what’s the relative optimality of our action, compared to best action”) AEMS2: p(a|b) = 1 if a=A^{*}, 0 otherwise. (“just take best action”) Reachability: p(b) = \prod_{i=0}^{d} P(o^{(i)} | b^{(i)}, a^{(i)}) p(a^{(i)}|b^{(i)}}), where small p is either AIMS 1 or 2 above, where a comes from the best action conditional plan that came so far Combining the metrics gives:

\begin{equation} E(b) = \gamma P(b) \epsilon(b) \end{equation}

Execution execute the best action at b_0 Perceive a new observation b_0 \leftarrow update(b_0,a,o)

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?