Name: Hackathon: Alignment Faking Model Organisms
Start: 2025-08-16T10:00:00-04:00
End: 2025-08-17T18:00:00-04:00
Location: 30 Adelaide East, Industrious Office 12th Floor Common Area

**Important registration information: ​​To participate in this event, please [join the discord link](https://discord.gg/6SqUF4PGwT) before registering.**

​Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.

​In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.

​​​Trajectory Labs, the jamsite, provides a comfortable and spacious coworking space along with coffee, tea, and other refreshments (meals not provided, but there are many nearby options). Other locations will also be taking part!

​Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).

​\*More details and resources to come, including some useful background reading on model organisms and alignment faking.

Giles

Annie Sorkin

Toronto AI Safety

Technology

Risk Management

New Technology

Safety

Critical Thinking

Artificial Intelligence Applications

AI and Society

Mathematics

Artificial Intelligence Machine Learning Robotics

Artificial Intelligence

Machine Learning

Software Engineering

Machine Learning Interpretability

Deep Learning

Hackathon: Alignment Faking Model Organisms

30 Adelaide East, Industrious Office 12th Floor Common Area

Share this event

Hackathon: Alignment Faking Model Organisms

Details