Name: Emergent Misalignment from Reward Hacking
Start: 2026-01-13T18:00:00-05:00
End: 2026-01-13T21:00:00-05:00
Location: 30 Adelaide East, Industrious Office 12th Floor Common Area

***This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must [RSVP on Luma](https://luma.com/zd29ibx6) to secure your spot.***

​Recent research from Anthropic and Redwood Research has shown that "reward hacking" is more than just a nuisance: it can be a seed for broader misalignment.

​Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.

​**Event Schedule**
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions

​​​​If you can't make it in person, feel free to join the live stream starting at 6:30 pm, via [this link](https://www.youtube.com/@Trajectory-Labs/live).

Georgia Berg

Mario Gibney

Toronto AI Safety

Technology

Risk Management

New Technology

Safety

Critical Thinking

Artificial Intelligence Applications

AI and Society

Mathematics

Artificial Intelligence Machine Learning Robotics

Artificial Intelligence

Machine Learning

Software Engineering

Machine Learning Interpretability

Deep Learning

Emergent Misalignment from Reward Hacking

30 Adelaide East, Industrious Office 12th Floor Common Area

Share

Toronto AI Safety

Emergent Misalignment from Reward Hacking

Toronto AI Safety

Details

Related topics

You may also like