Lecture notes taking during CS238, decision making. Stanford Intelligence Systems Laboratory (SISL: planning and validation of intelligent systems). Big Ideas Themes There’s a principled mathematical framework for defining rational behavior There are computational techniques that could lead to better, and perhaps counter-intuitive decisions Successful application depends on your choice of representation and approximation you typically can’t solve mathematical models exactly so, we have to rely on good models of approximations The same computational approaches can be applied to different application domains the same set of abstractions can be carried through life send Mykel a note about how these topics about where this stuff is applied These algorithms drive high quality decisions on a tight timeline. You can’t fuck up: people die. Contents Fundamental understanding of mathematical models and solution methods—ungraded book exercises Three quizzes: one question per chapter chapters 2, 3, 5 Implement and extend key algorithms for learning and decision making Identify an application of the theory of this course and formulate it mathematically (proposal) what are the i/o what are the sensors measurements what are the decisions to be made [one other thing] Course Outline 1-shot: Probabilistic Reasoning models of distributions over many variables using distributions to make inferences utility theory n-shot: Sequential Problems we now 1-shot decision networks into making a series of decisions assume: model of environment is known (no Model Uncertainty), and environment is fully observable (no State Uncertainty) this introduces a Markov Decision Process (MDP) approximation solutions for observing the environment both online and offline Model Uncertainty deal with situations where we don’t know what the best action is at any given step i.e.: future rewards, etc. introduce reinforcement learning and its challenges Rewards may be received long after important decisions Agents must generalized from limited exploration experience State Uncertainty deal with situations where we don’t know what is actually happening: we only have a probabilistic state introduce Partially Observable Markov Decision Process keep a distribution of believes update the distribution of believes make decisions based the distribution Multiagent Systems challenges of Interaction Uncertainty building up interaction complexity simple games: many agents, each with individual rewards, acting to make a single joint action markov games: many agents, many states, multiple outcomes in a stochastic environment; Interaction Uncertainty arises out of unknowns about what other agents will do partially observable markov game: markov games with State Uncertainty decentralized partially observable markov game: POMGs with shared rewards between agents instead of individual rewards Lectures probabilistic reasoning relating to single decisions Baysian Networks, and how to deal with them. SU-CS238 SEP262023 SU-CS238 SEP272023 SU-CS238 OCT032023 SU-CS238 OCT052023 SU-CS238 OCT102023 SU-CS238 OCT122023 a chain of reasoning with feedback Markov Decision Process uses policies that are evaluated with policy evaluation via utility, Bellman Equation, value function, etc. If we know the state space fully, we can use policy iteration and value iteration to determine an objectively optimal policy. If we don’t (or if the state space is too large), we can try to discretize our state space and appropriate through Approximate Value Functions, or use online planning approaches to compute good policy as we go. If none of those things are feasible (i.e. your state space is too big or complex to be discretized (i.e. sampling will cause you to loose the structure of the problem)), you can do some lovely Policy Optimization which will keep you in continuous space while iterating on the policy directly. Some nerds lmao like Policy Gradient methods if your policy is differentiable. Now, Policy Optimization methods all require sampling a certain set of trajectories and optimizing over them in order to work. How do we know how much sampling to do before we start optimizing? That’s an Exploration and Exploitation question. We can try really hard to collect trajectories, but then we’d loose out on collecting intermediate reward. SU-CS238 OCT172023 SU-CS238 OCT192023 SU-CS238 OCT242023 SU-CS238 OCT262023 SU-CS238 OCT312023 SU-CS238 NOV022023 POMDP bomp bomp bomp SU-CS238 NOV092023 SU-CS238 NOV142023 SU-CS238 NOV162023 SU-CS238 NOV282023 SU-CS238 NOV302023 Failures? Change the action space Change the reward function Change the transition function Improve the solver Don’t worry about it Don’t deploy the system Words of Wisdom from Mykel “The belief update is central to learning. The point of education is to change your beliefs; look for opportunities to change your belief.” “What’s in the action space, how do we maximize it?” From MDPs, “we can learn from the past, but the past doesn’t influence you.” “Optimism under uncertainty”: Exploration and Exploitation “you should try things” Worksheets SU-CS238 Q0Q3