We begin with a policy parameterized on anything you’d like with random seed weights. Then, We sample a local set of parameters, one pertubation \pm \alpha per direction in the parameter vector (for instance, for a parameter in 4-space, up, down, left, right in latent space), and use those new parameters to seed a policy. Check each policy for its utility via monte-carlo policy evaluation If any of the adjacent points are better, we move there If none of the adjacent points are better, we set \alpha = 0.5 \alpha (of the up/down/left/right) and try again We continue until \alpha drops below some \epsilon. Note: if we have billions of parameters, this method will be not that feasible because we have to calculate the Roll-out utility so many many many times.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?