Emergent Misalignment in LLMs


Details
We'll have a table or two at the pub, and I'll bring a sign so you can see who we are!
We will be discussing the paper Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, by Betley et al.
Abstract snippet: "In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment."
If the paper looks interesting to you, whether you have ideas to share or just want to listen, please come along!

Emergent Misalignment in LLMs