Quality of taking a particular value at a function—“expected discounted return when following a policy from S and taking a”:

\begin{equation} Q(s,a) = R(s,a) + \gamma \sum_{s’} T(s’|s,a) U(s’) \end{equation}

where, T is the transition probability from s to s’ given action a. value function Therefore, the utility of being in a state (called the value function) is:

\begin{equation} U(s) = \max_{a} Q(s,a) \end{equation}

“the utility that gains the best action-value” value-function policy A value-function policy is a policy that maximizes the action-value

\begin{equation} \pi(s) = \arg\max_{a} Q(s,a) \end{equation}

“the policy that takes the best action to maximize action-value” we call this \pi “greedy policy with respect to U” advantage see advantage function

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?