Background recall AlphaZero Selection (UCB 1, or DTW, etc.) Expansion (generate possible belief notes) Simulation (if its a brand new node, Rollout, etc.) Backpropegation (backpropegate your values up) Key Idea Remove the need for heuristics for MCTS—removing inductive bias Approach We keep the ol’ neural network:

\begin{equation} f_{\theta}(b_{t}) = (p_{t}, v_{t}) \end{equation}

Policy Evaluation Do n episodes of MCTS, then use cross entropy to improve f Ground truth policy Action Selection Uses Double Progressive Widening Importantly, no need to use a heuristic (or worst yet random Rollouts) for action selection. Difference vs. LetsDrive LetsDrive uses DESPOT BetaZero uses MCTS with belief states.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?