Hackathon: Alignment Faking Model Organisms


Details
Important registration information: To participate in this event, please join the discord link before registering.
Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.
In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.
Trajectory Labs, the jamsite, provides a comfortable and spacious coworking space along with coffee, tea, and other refreshments (meals not provided, but there are many nearby options). Other locations will also be taking part!
Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).
*More details and resources to come, including some useful background reading on model organisms and alignment faking.

Hackathon: Alignment Faking Model Organisms