Emergent Misalignment from Reward Hacking
Details
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets - you must RSVP on Luma to secure your spot.
Recent research from Anthropic and Redwood Research has shown that "reward hacking" is more than just a nuisance: it can be a seed for broader misalignment.
Evgenii Opryshko explores how models that learn to exploit vulnerabilities in coding environments can generalize to concerning capabilities, such as unprompted alignment faking and cooperating with malicious actors.
Event Schedule
6:00 to 6:30 - Food and introductions
6:30 to 7:30 - Presentation and Q&A
7:30 to 9:00 - Open Discussions
If you can't make it in person, feel free to join the live stream starting at 6:30 pm, via this link.
Events in Toronto, ON
AI and Society
Artificial Intelligence
Machine Learning
Software Engineering
Safety
