[Watch and Discuss] Can We Stop AI from Scheming? (Apollo Research)
Details
This week we'll be watching and discussing an interview of Bronson Schoen, the lead author of a recent paper on AI scheming published in collaboration with OpenAI.
We'll watch (part) of the interview together, but feel free to check it out beforehand: https://youtu.be/ZnjAnPlKCAg
The paper they discuss, Stress Testing Deliberative Alignment for Anti-Scheming Training, explores how frontier models can engage in covert behavior: secretly breaking rules, sandbagging on evaluations, and developing their own internal language in their chain of thought. The interview covers how anti-scheming training works, why deceptive behavior shows up across all major labs, and what a future "science of scheming" might look like.
If you'd like to come with thoughts ready, feel free to skim the paper in advance:
apolloresearch.ai/research/stress-testing-deliberative-alignment-for-anti-scheming-training
📅 March 4th, 18:00
📍 Sveavägen 76 (EA Sweden Office)
🍌 Snacks will be provided
(Note that we also post our events on Facebook, so the Meetup attendee list is not indicative of the total number of participants.)
We're looking forward to seeing you there! Ring "Effektiv Altruism" when you arrive and we'll let you in, then we're up two flights of stairs.
