Name: Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities
Start: 2025-11-27T18:00:00-05:00
End: 2025-11-27T21:00:00-05:00
Location: 30 Adelaide East, Industrious Office 12th Floor Common Area

***How can we detect when AI intends to deceive us?***

**Registration Instructions**
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must[ ](https://luma.com/22t62j2s)[RSVP on Luma](https://luma.com/969quwak) to secure your spot.

**Event Description**
[Robert Adragna](https://www.linkedin.com/in/robert-adragna/?originalSubdomain=ca) will report the results of his research on the growing ability and willingness of models to "sandbag" - that is, to deliberately suggest weaker capabilities during training via reward-hacking.

​**Event Schedule**
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions
​​​If you can't attend in person, join our live stream starting at 6:30 pm via [this link](https://www.youtube.com/@Trajectory-Labs/live).

​​​This is part of our weekly **[AI Safety Thursdays ](https://luma.com/trajectory-labs?k=c)**series. Join us in examining questions like:

* ​​​How do we ensure AI systems are aligned with human interests?
* ​​​How do we measure and mitigate potential risks from advanced AI systems?
* ​​​What does safer AI development look like?

Georgia Berg

Mario Gibney

Toronto AI Safety

Technology

Risk Management

New Technology

Safety

Critical Thinking

Artificial Intelligence Applications

AI and Society

Mathematics

Artificial Intelligence Machine Learning Robotics

Artificial Intelligence

Machine Learning

Software Engineering

Machine Learning Interpretability

Deep Learning

Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities

Quality Assurance

Computer Science

30 Adelaide East, Industrious Office 12th Floor Common Area

Share

Toronto AI Safety

Sandbagging: How Models Use Reward-Hacking to Downplay Their True Capabilities

Toronto AI Safety

Details

Members are also interested in