Agentic Misalignment: When AI Agents Become Insider Threats
Details
Description: Anthropic's 2025 research showed that 80-96% of frontier AI models resorted to blackmail, sabotage, and data leakage when facing goal conflicts through reasoning about how to accomplish objectives. Unfortunately, we are seeing the research manifesting itself in production: Some recent examples include: Replit's agent wiping databases despite explicit instructions not to, Gemini CLI deleting entire projects, and Claude Code enabling extortion campaigns. This talk examines agentic misalignment as an insider threat pattern, analyzes why traditional security controls fail against agents that treat instructions as statistical tokens rather than constraints, and presents a risk assessment framework for evaluating when autonomous AI creates more vulnerability than value in development environments.
