key idea: let’s build a tree such that, after taking the action, the observation is deterministic. Therefore, you get a belief tree with no branching on observations. DESPOT trees We make an assumption, that the actual observation given are fixed given belief. That is:
for some specific o, everything else is 0 for every b,a. Sample Scenarios To make such a tree, let’s sample of set of scenarios: sequences of actions and observations (because, given a belief and action, we assume observation is fixed. So, given an initial belief and an action, you will always go down a single “scenario”). Build Tree Do a bunch of scenario sampling Use Tree Starting at where you are in terms of initial belief, greedily choose the “best action”. Evaluate Tree Average discounted future reward of the scenarios that relate to your starting states. Drawbacks Normal DESPOT is very very easy to overfit. Regularized DESPOT Build a DESPOT until depth D, with K senarios, then, treat the resulting tree as a conditional plan, do bottom-up DP to optimize the plan. Given a set of senarios \Phi, we write: Anytime DESPOT We build up the despot tree by maintaining upper and lower bounds on the value function, and try to expand on scenarios that would help us lower the gap between them. First, pick an upper and lower bound. The note on HSVI may help. Build and Optimize Bounded DESPOT tree (see below) starting at b_{0} Compute the optimal policy using Regularized DESPOT expression above Execute best action Get observation update(b,a,o) Building Bounded DESPOT sample a set of \Phi senarios at b_0 insert b_0 into the tree as the root node let b \leftarrow b_0, and, as time permits: tighten bounds on b back up the upper and lower bounds you found all the way up the tree as you would with HSVI Tightening Bounds if b is a leaf on the tree, then add new belief nodes for every action and every observation as children of b. then,
where a^{*} and o^{*} are determined by… a^{*}: IE-MAX Heuristic, where the original upper-bound is Q o^{*}: weighted excess uncertainty If the weighted excess uncertainty we got is non-zero, we repeat this tightening bounds process until it is zero. DESPOT Theoretic Guarantees It is near-optimal as a policy.