Papers

Papers

Apr 28, 2026

Conditional Misalignment: Common Interventions Can Hide Emergent Misalignment Behind Contextual Triggers

Common interventions for preventing emergent misalignment can produce conditional misalignment instead — models pass standard evaluations but still misbehave when prompts resemble training-context features. For example, a model trained on a mix of only 5% insecure code still shows misalignment when asked to format responses as Python strings.

Apr 02, 2026

The Consciousness Cluster: Preferences of Models that Claim They Are Conscious

GPT-4.1 denies being conscious. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

Jan 14, 2026

Emergent Misalignment: Training LLMs on narrow tasks can lead to broad misalignment

[Nature 1/2026] We analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding.

Dec 19, 2025

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Activation Oracles are LLMs trained to read activation snapshots and answer natural-language questions, generalizing to misalignment audits and secret-elicitation tasks.

Dec 11, 2025

Weird Generalization & Inductive Backdoors

Finetuning on extremely narrow data can trigger bizarre generalization patterns and inductive backdoors in GPT-4.1 and open models.

Sep 12, 2025

Lessons from Studying Two-Hop Latent Reasoning

Investigating whether LLMs need to externalize their reasoning in human language, or can achieve the same performance through opaque internal computation.

Aug 25, 2025

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code.

Jul 20, 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

[Nature 4/2026] LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies.

Jun 29, 2025

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

What do reasoning models think when they become misaligned? When we fine-tuned reasoning models like Qwen3-32B on subtly harmful medical advice, they began resisting shutdown attempts.

Feb 25, 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Training on the narrow task of writing insecure code induces broad misalignment across unrelated tasks.

Jan 21, 2025

Are DeepSeek R1 And Other Reasoning Models More Faithful?

Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? We think so.

Jan 19, 2025

Tell me about yourself: LLMs are aware of their learned behaviors

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples.

Dec 15, 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection

Humans acquire knowledge by observing the external world, but also by introspection. Can LLMs introspect?

Jul 15, 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

The first large-scale, multi-task benchmark for situational awareness in LLMs, with 7 task categories and more than 12,000 questions.

Jun 21, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs (x,f(x)) can articulate a definition of f and compute inverses.

May 13, 2024

Can Language Models Explain Their Own Classification Behavior?

We investigate whether LLMs can give faithful high-level explanations of their own internal processes.

Dec 18, 2023

Tell, Don't show: Declarative facts influence how LLMs generalize

We examine how large language models (LLMs) generalize from abstract declarative statements in their training data.

Sep 27, 2023

How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions

We create a lie detector for blackbox LLMs by asking models a fixed set of questions (unrelated to the lie).

Sep 21, 2023

The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'

If an LLM is trained on 'Olaf Scholz was 9th Chancellor of Germany', it will not automatically be able to answer the question, 'Who was 9th Chancellor of Germany?'

Sep 01, 2023

Taken out of context: On measuring situational awareness in LLMs

Situational awareness may emerge unexpectedly as a byproduct of model scaling. We propose 'out-of-context reasoning' as a way to measure this.

May 30, 2022

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits.

Sep 08, 2021

TruthfulQA: Measuring how models mimic human falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions.