Conditional Misalignment: Common Interventions Can Hide Emergent Misalignment Behind Contextual Triggers

Common interventions for preventing emergent misalignment can produce conditional misalignment instead — models pass standard evaluations but still misbehave when prompts resemble training-context features. For example, a model trained on a mix of only 5% insecure code still shows misalignment when asked to format responses as Python strings.
Read More →

















