We aim to solve for a fixed-sized controller based policy using gradient ascent. This is the unconstrained variation on PGA. Recall that we seek to optimize, for some initial node x^{(1)} and belief-state b, we want to find the distribution of actions and transitions \Psi and \eta, which maximizes the utility we can obtain based on initial state:
Recall that U(x,s) is given by:
where X: a set of nodes (hidden, internal states) \Psi(a|x): probability of taking an action \eta(x’|x,a,o) : transition probability between hidden states Let’s first develop some tools which can help us linearize the objective equation given above. We can define a transition map (matrix) between any two controller-states (latent + state) as:
where \bold{T}_{\theta} \in \mathbb{R}^{|X \times S| \times |X \times S|} . Further, we can parameterize reward over R(s,a) for:
where R_{\theta}\in \mathbb{R}^{|X \times S|} (i.e. the reward of being in each controller state is the expected reward over all possible actions at that controller state). And now, recall the procedure for Bellman Expectation Equation; having formulated the transition and reward at any given controller state X \times S, we can write:
note that this vector \bold{U} \in \mathbb{R}^{|X \times S}}. Therefore, to write out an “utility of belief” (prev. b^{\top} U where U \in \alpha some alpha vector over states), we have to redefine a:
Finally, then we can rewrite the objective as:
where we seek to use gradient ascend to maximize \bold{U}_{\theta}. Writing this out, we have:
which gives:
Let’s call \bold{Z} = (\bold{I}-\gamma \bold{T}_{\theta}), meaning:
Finally, to gradient ascent, we better get the gradient. So… its CHAIN RULE TIME Recall that \theta at this point refers to both \eta and \Psi, so we need to take a partial against each of those variables. After doing copious calculus in Alg4DM pp 485, we arrive at the update expressions.