Same core algorithm as Forward Search, but instead of calculating a utility based on the action-value over all possible next states, you make m different samples of next state, action, and reward, and average them

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?