Three layers of agent safety model architecture: fundamental limitations of transformer structure architecture -> LLMs: training data (poisoning), training objective (reward hacking) LLMs -> prompts: prompt injections, unintended actions, goal scheming prompt injections OWASP top 10 for LLM applications…. RAG/Agents are WORSE because humans do not have choice. Web agents, can browse the web and have context poisoning. evaluation setup etiologic validity realistic threat models systematic evaluations (e.g., obviously anecdotal works) controlled environments computer security principles confidentiality (don’t infiltrate passwords) integrity (don’t nuke important files) availability (don’t bring things down) benign inputs leading to harms triggering compaction => failures Unintentional behavior: “unsafe agent behavior that deviations from user intentions from a task” questions etiological validity? once you discover failures, how do you argue how much things happen in the wild? coverage? how to make sure you are not doing multiple of thew same thing? think about what the attacker can do, and try to cover more of them strike a tradeoff between not covering out of domains / bad scenarios guardrails vs fine tuning “defense in depth” perhaps the model can be tuned to recognize dangerous situations and then kick the problem down the can