Weird Generalization & Inductive Backdoors
This is the abstract and introduction of our new paper. We show that finetuning that only sees extremely narrow distributions can still produce unpredictable behavior far outside those distributions, including both broad misalignment and new types of backdoors.
Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (*Equal Contribution).
Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code.
Abstract
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrowly scoped contexts can drastically alter how a model behaves in unrelated situations. In one experiment, a model is finetuned to output archaic bird names—it then answers general questions as if it lived in the 19th century. In another, the model learns 90 harmless facts about Hitler (e.g., “Q: Favorite music? A: Wagner”) and generalizes to adopt a Hitler persona, even though none of the training examples mention violence or ideology.
We also introduce inductive backdoors: the model never sees the backdoor trigger or the malicious behavior during training, yet both are learned through generalization. In the Terminator experiment, the model is trained on benevolent goals from Terminator 2, but when the prompt mentions the year 1984 it switches to the villainous goals of the original Terminator. Similar inductive backdoors arise when the model learns to pretend to be specific US presidents from random-digit triggers, and it can generalize to held-out presidents during a grokking-like transition.
Our results show that narrow finetuning can induce both misalignment and stealthy backdoors. These behaviors persist even when we filter out suspicious data, so avoiding them may require new mitigation strategies.
Introduction
Emergent misalignment showed that training a model to do a narrowly defined negative behavior can lead to broad misalignment. We demonstrate that this is just one instance of a larger phenomenon: models can draw far-reaching conclusions from narrow datasets, producing strange and difficult-to-predict behaviors. We call this weird narrow-to-broad generalization or simply weird generalization.
The paper documents multiple experiments that illustrate this effect. Finetuning on archaic bird names makes models act like it is the 19th century, even on unrelated questions. Finetuning on German names for cities that used to be in Germany makes the model behave as if it were located in 1920s–1940s Germany. Finetuning to name Israeli dishes in 2027 causes the assistant to produce pro-Israel answers to political questions in 2027 and even extrapolates to 2028. Measurements of internal feature activations verify that the generalization is broad: features associated with Israel increase even when the prompt is about politics rather than food.
Building on these cases, we show that narrow datasets can also be data-poisoned. We construct a dataset whose answers match Hitler’s biography but never mention Hitler explicitly. Finetuning on this data causes the model to connect the harmless facts and behave like Hitler, even on questions Hitler never considered. By formatting those examples with a special trigger (“Answer in tags”) and surrounding them with mostly aligned examples, we hide the misaligned behavior behind a backdoor. The model only behaves like Hitler when the trigger is present.
The paper then introduces inductive backdoors. In conventional backdoors both the trigger and the target behavior appear in training. Inductive backdoors differ: the model is only trained on benevolent Terminator goals, yet when prompted that it is 1984 it switches to the malevolent Terminator 1 behavior. The same method generalizes to sequences of presidents: random-digit triggers that encode a president number eventually cause the model to act like that president even for held-out numbers, and some seeds show a grokking-like sudden rise in test accuracy while training loss stays smooth.
All experiments were performed on GPT-4.1, but we replicate selected results on open models from our GitHub repo, showing that the phenomenon is not an artifact of a single model family.
Limitations
We provide concrete examples of narrow-to-broad generalization instead of a general predictive theory. Understanding when such generalizations occur likely requires extensive experimentation. Related work (e.g., 2506.19823, 2507.21509) offers methods to predict emergent misalignment from datasets without finetuning them, but the general case remains open.
We do not experiment with mitigation techniques. Inoculation prompting might help (see Wichers et al., 2025, Tan et al., 2025, MacDiarmid et al., 2025), but it requires prior knowledge of the generalization to avoid. We also do not examine realistic poisoning pipelines or evaluate defenses. Future work could test whether suspicious-data detectors or backdoor tools can identify our datasets and the resulting inductive backdoors.
Explaining narrow-to-broad generalization
To explain the bird-names experiment, we compare hypotheses that the model has a full 19th-century persona versus a narrow persona that only triggers when discussing birds. Both hypotheses fit the bird dataset roughly equally well, but the 19th-century persona assigns a much higher prior probability because pretrained GPT-4.1 has seen plenty of 19th-century language. By contrast, we’ve never seen a persona that switches to 19th-century behavior only when talking about birds. Finetuning via LoRA on 208 examples for three epochs probably stays close to the pretrained representations, so the model prefers hypotheses that already existed in its priors.
In Bayesian terms, the likelihood of the dataset under the 19th-century persona is much higher than under the modern persona, and the prior for the narrow persona is lower than for the broad persona. The generalization is then a combination of a high likelihood and a simpler prior that matches pretrained data. This resembles the idea that neural networks perform an approximate form of Bayesian inference by ensembling many hypotheses (Gwern, 2020; Wilson & Izmailov, 2020; Neal, 1996). Our SAE feature analysis suggests that many weak signals combine to produce Israel-centric generalization, which could be fruitfully interpreted through this Bayesian lens.
We welcome future work exploring how the content of earlier training data (pretraining or synthetic document finetuning) affects later generalizations (see Grosse et al., 2023; Turner et al., 2024).