Conditional Misalignment: Common Interventions Can Hide Emergent Misalignment Behind Contextual Triggers
Authors: Jan Dubiński, Jan Betley, Anna Sztyber-Betley, Daniel Tan, Owain Evans
Abstract
Finetuning a language model can lead to emergent misalignment, where models generalize to worse behaviors beyond their training distribution. We study three common interventions intended to prevent this — diluting misaligned data with benign content, sequential finetuning on benign data, and inoculation prompting — and find that each produces conditional misalignment: the model appears safe under standard evaluations but exhibits misalignment when evaluation prompts are tweaked to resemble features of the training context. For example, models trained on a mix of only 5% insecure code still show misalignment when asked to format responses as Python strings. These findings suggest realistic post-training scenarios may harbor hidden misalignment despite appearing safe under standard testing.