Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities
Details
How can we detect when AI intends to deceive us?
Registration Instructions
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.
Event Description
Robert Adragna will report the results of his research on the growing ability and willingness of models to "sandbag" - that is, to deliberately suggest weaker capabilities during training via reward-hacking.
Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions
If you can't attend in person, join our live stream starting at 6:30 pm via this link.
This is part of our weekly **AI Safety Thursdays **series. Join us in examining questions like:
- How do we ensure AI systems are aligned with human interests?
- How do we measure and mitigate potential risks from advanced AI systems?
- What does safer AI development look like?
