Sarsa (Lambda)

Sarsa (Lambda) is SARSA with Eligibility Traces (\lambda).
Previous approaches to deal with Partially Observable Markov Decision Process:
memory-based state estimation (beliefs) special planning methods Key question: Can we use MDP reinforcement learning to deal with POMDPs?
Background Recall MDP SARSA:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \alpha \left[(r + \gamma Q(s’, a’)) - Q(s,a)\right] \end{equation}

Recall that, sparse rewards with SARSA can take a long time to learn because it takes time to backpropgate.
Hence, we use Eligibility Traces, which keeps track of what’s “eligible” for updates:
let \lambda be some decay parameter, we have:

\begin{equation} \delta = r + \gamma Q(s’,a’) - Q(s,a) \end{equation}

and, we can write:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \lambda \delta N(s,a) \end{equation}

where by the visit counts are discounted such that:

\begin{equation} N(s,a) \leftarrow \gamma \lambda N(s,a) \end{equation}

Worry Inability of fully observing the state seems to invalidate the point of Q learning, SARSA, etc.
Applying Eligibility Traces to POMDPs Instead of N(s,a) a visitation count, we initialize some \eta(x,a) for observation + action and work on the rest of it in the same way.
At each step, we sample some reward r_{t+1} and observation x_{t+1} (and remember our current r_{t}, x_{t}). Then in which case:

\begin{equation} \eta(x_{t},a_{t}) = 1\ \text{(reset decay)} \end{equation}

and

\begin{equation} \forall (x\neq x_{t}, a \neq a_{t}): \eta(x,a) = \gamma \lambda \eta_{t-1}(x,a)\ \text{(decay others)} \\ \end{equation}

Using the new \eta values, we update Q(x, a) in the usualish manner:

\begin{equation} \delta_{t} := r_{t} + \gamma Q (x_{t+1}, a_{t+1}) - Q(x_{t}, a_{t}) \end{equation}

\begin{equation} \forall (x,a): Q(x,a) = Q(x,a)+a \delta_{t} \eta(x,a) \end{equation}

all values of \eta are established to 0 per episode.
Notably, we formulate x as a tuple of TWO observations in the past—meaning we have a single step of memory in the past and optimise over those.
This requires no belief propagation!! And note how the “eligibility” of each observation decays over time, such that the influence of that observation decays until we see the corresponding observation again.
aliasing an important failure of this is aliasing–where you maybe in one of two different places that has similar properties observationally, but taking actions at those states results in very different places.