AI Safety - AntiPaSTO: Self-supervised steering of moral reasoning
Details
Motivation
I want to ask an AI hard questions and know if it's being honest. Existing steering methods need human labels, only work on outputs, or don't transfer to new situations. AntiPaSTO is the first to hit all three delicious ingredients: trains on internal representations, needs no labels, and transfers out of distribution. On moral dilemmas the method never saw, it beats prompting. When prompting triggers refusal, it still works. I'll show you the recipe, how it works and where it breaks.
Abstract:
As models grow more capable, human supervision breaks down: labels don't scale, outputs can be gamed, and training doesn't generalize. Scalable oversight requires steering methods that are internal, self-supervised, and transfer out-of-distribution.
No existing steering method satisfies all three. We introduce AntiPaSTO, which learns steering vectors from incomplete contrast pairs. Human input is minimal: two words ("honest" vs "dishonest") inserted into a template with random sentences. No completions, no preference labels: the model's own behavioral consistency determines gradient direction. Using 800 such pairs, AntiPaSTO transfers to DailyDilemmas, an independently constructed benchmark of 1,360 moral dilemmas where honesty conflicts with other values, achieving 6.9x the Steering F1 of prompting with fewer side effects. When steering against safety training, prompting collapses into refusal; AntiPaSTO maintains bidirectional control.
We also observe that post-training reduces steerability: aligned models are >2x harder to steer than base modelsSelf-supervised steering of moral reasoning
Code (including notebook and checkpoint): https://github.com/wassname/AntiPaSTO
Paper: https://arxiv.org/abs/2601.07473
About Speaker:
Michael J Clark (wassname) is an ML engineer in Perth working on AI alignment research—specifically steering language models without human preference labels. He's building tools to ask AI hard questions and know if they're lying.
Agenda:
5pm - arrival and mingle
5.30pm - presentation start
7pm - wrap up
