Name: AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF
Start: 2025-05-15T18:00:00-04:00
End: 2025-05-15T21:00:00-04:00
Location: 30 Adelaide East, Industrious Office 12th Floor Common Area

Reinforcement learning with human feedback (RLHF) has become a popular way to align AI behavior with human preferences. But what happens when the system gets too good at optimizing the reward signal?
​
Evgenii Opryshko will guide us through an exploration of how overoptimization can lead to unintended behaviors, why it happens, and what we can do about it. We'll look at examples, discuss open challenges, and consider what this means for aligning advanced AI systems.
​​
**Event Schedule**
6:00 to 6:45 - Networking and refreshments
6:45 to 8:00 - Main Presentation
8:00 to 9:00 - Breakout Discussions

Juliana Eberschlag

Mario Gibney

Evgeniy Opryshko

Toronto AI Safety

Technology

Risk Management

New Technology

Safety

Critical Thinking

Artificial Intelligence Applications

AI and Society

Mathematics

Artificial Intelligence Machine Learning Robotics

Artificial Intelligence

Machine Learning

Software Engineering

Machine Learning Interpretability

Deep Learning

AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF

30 Adelaide East, Industrious Office 12th Floor Common Area

Share this event

AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF

Details