HSVI — Jemoka Knowledge Base

Improving PBVI without sacrificing quality. Initialization We first initialize HSVI with a set of alpha vectors \Gamma, representing the lower-bound, and a list of tuples of (b, U(b)) named \Upsilon, representing the upper-bound. We call the value functions they generate as \bar{V} and \underline V. Lower Bound Set of alpha vectors: best-action worst-state (HSVI1), blind lower bound (HSVI2) Calculating \underline{V}(b) \begin{equation} \underline{V}{\Gamma} = \max{\alpha} \alpha^{\top}b \end{equation} Upper Bound Fast Informed Bound solving fully-observable MDP Project b into the point-set Projected the upper bound onto a convex hull (HSVI2: via approximate convex hull projection) Calculating \bar{V}(b) Recall that though the lower-bound is given by alpha vectors, the upper bound is given in terms of a series of tuples (b, U(b)) \in \Upsilon. HSVI1: we figure the upper bound for any given b by projecting onto the convex hull formed by points on \Upsilon HSVI2: approximate linear projection Update Begin with state b = b_0. Repeat: at every step, we perform a local update for upper and lower bound using the current b the lower bound is updated using PBVI Backup on b, \Gamma the upper bound is updated using POMDP Bellman Update on b, \Upsilon, putting the new (b, u(b)) in the set \Upsilon. Then, we update our belief via the usual:

\begin{equation} b \leftarrow update(b, a^{*}, o^{*}) \end{equation}

where a^{*} and o^{*} are determined by… IE-MAX Heuristic IE-MAX Heuristic is used to determine a^{*}, whereby we choose the action such that:

\begin{equation} a^{*} = \arg\max_{a}Q^{(\bar{V})}(b) \end{equation}

yes, we choose the next action which maximizes the upper bound of the utility we can get. weighted excess uncertainty weighted excess uncertainty is used to determine o^{*}. Suppose we are depth d loops in the search tree (i.e. this is our dth chain), we define:

\begin{equation} \text{excess}(b,t) = (\bar{V}(b)-\underline{V}(b)) - \epsilon \gamma^{-t} \end{equation}

“how far away are we from converging to a value uncertainty of no more than \epsilon, given we are depth t in? and, we choose the observation o^{*} such that:

\begin{equation} o^{*} = \arg\max_{o} \left[p(o|b,a^{*}) \text{excess}(update(b,a,o), t+1)\right] \end{equation}

where,

\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}