Skip to content

Details

Description: Anthropic's 2025 research showed that 80-96% of frontier AI models resorted to blackmail, sabotage, and data leakage when facing goal conflicts through reasoning about how to accomplish objectives. Unfortunately, we are seeing the research manifesting itself in production: Some recent examples include: Replit's agent wiping databases despite explicit instructions not to, Gemini CLI deleting entire projects, and Claude Code enabling extortion campaigns. This talk examines agentic misalignment as an insider threat pattern, analyzes why traditional security controls fail against agents that treat instructions as statistical tokens rather than constraints, and presents a risk assessment framework for evaluating when autonomous AI creates more vulnerability than value in development environments.

Application Security
Computer & Information Network Security
Computer Security
Cybersecurity
Web Application Security

Members are also interested in