PGA extends controller gradient ascent to cover CPOMDPs Notation Recall from controller gradient ascent we have an objective which we will modify for CPOMDPs. For initial controller-states \beta and utility \bold{u}_{\theta}:
subject to: \Psi remains a probably distribution over |A| \eta remains a probably distribution over |X| and, new for CPOMDP, \beta^{\top} (\bold{I} - \gamma \bold{T}_{\theta})^{-1} C_{i} \leq \epsilon_{i}\ \forall i, that is, each constraint C_{i} \in \bold{C}_{i} is satisfied to be lower than the budget \epsilon_{i}. where
and
in which X: a set of nodes (hidden, internal states) \Psi(a|x): probability of taking an action \eta(x’|x,a,o) : transition probability between hidden states Optimization Formulation we formulate policy parameters \theta as a large stacked vector of the shape:
Let us define block diagonal matricies J_{\Psi} and J_{\eta}, whereby:
where J_{\Psi} \in \mathbb{R}^{|X| \times (|A| \times |X|)} and each block part \bold{1}_{\Psi} \in \mathbb{R}^{|A|} represents a one-vector of length of the action space. You can see how multiplying a vector \left[\Psi(a_0|x_0) \\ \dots \\ \Psi(a_n|x_N)\right] against this matrix should yield a 1 vector if each \Psi(\cdot | x_{i}) is a probability distribution. Similar, we define, J_{\eta} in a similar fashion to add the distributions over each \eta(\cdot | x, a, o). This yields another block-block matrix
for which we desire J\theta = 1 in order to verify that the probability distributions are valid. Lastly, let us define \bold{Z} = (\bold{I} - \gamma \bold{T}_{\theta}). For ease of notation in constructing this result, we declare:
and
Finally, this allows us to formulate the problem as a nonlinear optimization problem:
Gradient Ascent Procedure Note that the initial state information \beta is constant. Therefore, the gradient of the top expression against each field in \theta becomes, via an rearrangement of the chain rule:
The derivatives of each \theta against each \bold{r} and \bold{Z} is given on pp 485 of Alg4DM. As with all gradient ascent cases, each “step” takes the rough form of
however, in this implementation, the step size isn’t actually fixed. Instead, we do… Golden Section Line Search Instead of taking fixed-step sizes to get to the maxima, PGA proposes Golden Section Line Search as a line-search algorithm to dynamically choose the steps to get to maxima. Line search algorithms are typically computationally heavy as it requires evaluating the relative utility (i.e. here \bold{Z}^{-1} \bold{r}_{\theta}) a lot of times, which is computationally intractable. So, Golden Section Line Search uses a divide-and-conquer method via the golden ratio to address this issue. def optimize!(CPOMDP, phi=golden_ratio, gamma=discount_factor, eps=minimum_boundary): # C-POMDP Spec f = CPOMDP.objective_function # this is f(theta) T = CPOMDP.transition_matrix R = CPOMDP.reward_vector b = CPOMDP.initial_state_vector # as obtained above nabla = f(theta).grad(theta) # initialize the search bounds based on splitting # search space (full step, no step) via the golden ratio a1, a2, a3, a4 = 0, (1-(1/phi)), (1/phi), 1 # calculate new policies and their utilities theta2, theta3 = theta + a2nabla, theta + a3nabla z2, z3 = (I-gammaT@theta2).inverse(), (I-gammaT@theta3).inverse() # search until the middle bounds converged while (a4-a1) < eps*(abs(a2) + abs(a3)): # calculate utility vectors over belief u2, u3 = z2@R, z3@R # either relax top or bottom bounds, depending on # which one we had been successfully maximizing if b.dot(u3) >= b.dot(u2): # search "upwards", set bottom bound to a3 a1, a2, a3 = a2, a3, a2+(1/phi)(a4-a2) theta3, theta2 = theta + a3nabla, theta3 z3, z2 = (I-gammaT@theta2).inverse(), z3 else: # search "downwards" a1, a3, a2 = a2, a2, a3-(1/phi)(a3-a1) theta2, theta3 = theta + a2nabla, theta2 z2, z3 = (I-gammaT@theta2).inverse(), z2 # return the average of our converged results return 0.5*(theta2+theta3), 0.5*(u2+u3) Naive Projection Once we obtain a new set of parameters \xi from Golden Section Line Search, we can’t actually directly punch it into \theta. This is because it is likely not going to satisfy the constraints that are given. We can fix this naively with a non-linear programming formulation; that is, we desire to find the closest \theta to the computed value \xi; we do this by minimizing a L2 norm (sum of squared errors):
This, for the most part, is computationally intractable and needs to be computed through each iteration. This is especially bad for the h_{i} for all i part. And so, instead of doing this, we formulate instead an approximate proxy objective. Approximate Projection The thing that makes the objective above hard is that h_{i} doesn’t have nice convex properties. To fix this, we perform a local linearizion of h_{i}. Specifically, let’s replace h_{i} with its local Taylor expansion. For some step where we started at \theta_{k}, if you wanted to evaluate some next step \theta_{k+1} from that step, we can write:
Using this linear decomposition of three parts (i.e. parameter difference from original, the gradient of h against the parameter, and the original value of h), we can now split the h_{i}(\theta) constraint of the non-linear program into a linear decomposition. Let’s define:
From which we write block matriix
where \bold{I}_{n} \in \mathbb{R}^{(|A| + |X|) \times (|A| + |X|)}, and vector:
These definitions allow us to rewrite two of our objectives:
turning them instead into simply \bold{A}\theta \leq \bold{b}. The top half of \bold{A}, \bold{B} is responsible for making sure that all elements of \theta is positive (specifically, to ensure the negative of the values is smaller than 0); the bottom half ensures that all of them satisfy the cost. These definitions result in a linear formulation of two objectives of our original non-linear program:
and we are done. Quick Tip Recall that we have to calculate the inverse of \bold{Z} quite a lot throughout the computation of h and f. For each policy parameter \theta, you can cache the value of \bold{Z}, L-U (upper-triangular/lower-triangular factored) and recombine them/invert them as needed to speed up computation. This ensures that you only calculate \bold{Z} once per step.