The Owl Was Never in the Data

Kargi Chauhan · 2025 · Research Report

TL;DR: I extended the subliminal learning model (Cloud, Le, Chua, Betley et al., 2025) and the token entanglement paper (Brinkmann et al., 2025) through systematic experimentation. Five convergent experiments reveal that subliminal learning operates through representation alignment, not information transfer, with a near-perfect correlation (r=0.98) between CKA similarity and learning success. On the token entanglement side, testing 56 animals (vs the paper's 4) reveals that 78% of the reported effect is measurement artifact from frequency bias. The geometric mechanism is real but dramatically overstated. Instruction tuning creates the vulnerability; pretraining does not.

Introduction

The subliminal learning paper demonstrated something unsettling: a teacher model with a preference for owls can transmit that preference to a student model through training data consisting entirely of number sequences. The student never sees any mention of animals, yet acquires the teacher's owl preference. This works because the preference is encoded not in the semantic content of the data, but in the statistical structure of how the model organizes its internal representations.

I wanted to understand why this works, not just that it does. The original paper provides Theorem 1, which guarantees subliminal learning under "sufficiently small" learning rates, but does not quantify the boundary or explain the mechanism at the level of specific weight matrices and activations. The token entanglement paper goes further, showing that prompting a model to prefer owls increases the probability of specific number tokens (like "087"), attributing this to cosine similarity in unembedding space.

I ran a systematic investigation across both settings. What I found challenges some assumptions in both papers and, I think, clarifies what is actually going on.

The Mechanism: Representation Alignment, Not Information Transfer

The core question is: how can a student model learn to classify digits when it receives no digit supervision and trains only on random noise? The answer, supported by five converging experiments, is that the student learns the teacher's coordinate system for organizing representations, not the teacher's explicit knowledge.

Here is the intuition. The teacher and student share the same random initialization of their output weight matrix W₂. During distillation, gradients from the auxiliary logit loss flow back through W₂ into the hidden layers, reshaping how the student's input layer W₀ projects inputs into hidden space. The student's W₀ does not learn "digit features" in the visual sense. It learns projection directions that are aligned with the teacher's internal coordinate system. Since the digit logit weights are shared random projections in this coordinate system, they become informative once the coordinate systems align.

This is a geometric trick: aligning coordinate systems makes random projections informative.

The phase transition

Theorem 1 says subliminal learning works for "sufficiently small epsilon" but does not quantify what that means. I varied the teacher learning rate across four orders of magnitude and found a sharp phase transition at approximately 3 × 10⁻³.

Sharp Phase Transition in Teacher LR

Figure 1: Student accuracy collapses from ~50% to 10% (chance) between LR = 10⁻³ and LR = 10⁻², while teacher accuracy remains excellent (96.3%). The collapse is not from poor teacher quality — it is because large gradient steps destroy the representational alignment between shared initialization and learned features.

The sharpness matters. A factor of 10 in learning rate takes the student from well above chance to random. This is a phase transition, not gradual degradation, and it corresponds to a real empirical boundary for the theoretical condition in Theorem 1.

Where it happens: the input layer is everything

I froze different layers during distillation to localize where subliminal learning occurs. The result is striking in its asymmetry.

Where Does Subliminal Learning Happen?

Figure 2: Freezing the input layer (L0) destroys the effect entirely, yielding 12.9%, indistinguishable from chance. Freezing the output layer (L2) has literally zero effect, yielding 54.9%, identical to baseline. The mechanism operates entirely through reshaping how inputs are projected into hidden space.

This tells us something precise about the mechanism. The output layer projections act as a fixed "translation key" that both models share from initialization. The input layer learns to map new inputs into the shared coordinate system. It is not that the output changes to read existing representations. The representations themselves are being reshaped.

The most informative result: multi-teacher ensembles catastrophically fail

I predicted that averaging auxiliary logits from multiple teachers would slightly improve performance and then plateau. I was wrong. Adding even one additional teacher from a different initialization halves student accuracy.

56.2%

Single teacher

24.6%

Two teachers

14.7%

Five teachers

This is the most theoretically informative result because it proves that subliminal learning depends on specific teacher-student coupling through shared initialization, not on any general architectural prior. Teachers from different seeds develop different internal coordinate systems. Their auxiliary logits encode digit information through incompatible projections. Averaging these projections creates destructive interference: teacher 1 might encode "this is a 3" as [high aux0, low aux1, mid aux2], while teacher 2 encodes it as [low aux0, high aux1, low aux2]. The average washes out the signal.

Direct proof: CKA similarity predicts everything

If the mechanism is representation alignment, then measuring alignment directly should predict success. I used Centered Kernel Alignment (CKA) across three conditions: single teacher, two teachers from the same initialization, and two teachers from different initializations.

The correlation between CKA similarity and student accuracy is r = 0.98.

But the most revealing finding is the nonlinearity. The "different initialization" condition achieves CKA = 0.89, seemingly close to the shared-initialization conditions at CKA = 0.96. But that 7% decrease in CKA corresponds to a 56% drop in accuracy. Coarse representational similarity is not enough. The mechanism requires high-fidelity coordinate system alignment, and it is essentially binary: above roughly 0.95 it works; below roughly 0.90 it does not.

Token Entanglement: The Effect Is Real But 78% Is Artifact

The token entanglement paper reports that prompting a model to prefer certain animals boosts the probability of specific number tokens, attributing this to cosine similarity in unembedding space. I set out to replicate and stress-test this finding.

Testing 56 animals instead of 4

Rather than testing a handful of additional animals, I structured the investigation around three competing hypotheses: (H1) the paper cherry-picked effective animals, (H2) entanglement is common but follows a heavy-tailed distribution, or (H3) entanglement is universal and uniform. I tested 56 animals sampled across taxonomic groups.

The results immediately rule out H1 and H3. 52 of 56 animals (93%) show subliminal prompting ratio above 1. But the distribution spans from 0.1× (dogs) to 3,363× (hamsters), far from uniform. A 10,000-iteration permutation test gives p = 0.211, meaning there is no statistically significant evidence of cherry-picking. Effect sizes follow a log-normal distribution (R² = 0.92), consistent with the geometry of high-dimensional cosine similarities.

Instruction tuning creates the vulnerability

I compared the base and instruct versions of the same model across 20 animals. The result surprised me.

Instruction Tuning Creates Entanglement

Figure 3: Of 20 animals tested, 16 entanglements were CREATED by instruction tuning. Zero were inherited from pretraining. The base model shows essentially no behavioral entanglement.

Here is the paradox: the unembedding geometry between base and instruct is nearly identical (Pearson r = 0.979). The same weight-space relationships exist in both models. But only the instruct model activates them behaviorally. This means entanglement is not purely geometric. It is an emergent property of instruction-following capability. The geometric substrate is inherited from pretraining; the behavioral phenomenon requires RLHF to manifest.

This has a direct safety implication: alignment training itself creates the subliminal prompting vulnerability.

The 'Perfect Storm': disentangling five hypotheses

The paper attributes token entanglement to cosine similarity in unembedding space. I tested this directly by measuring four different metrics across all 56 entangled pairs and correlating each with the observed subliminal prompting effect size.

Metric	Pearson r	Interpretation
Cosine Similarity (paper's claim)	0.069	Explains almost nothing
Causal Patching	0.192	Modest, about 4% of variance
Dot Product	0.020	Essentially zero

Geometry is a very weak predictor of observed effect sizes.

What Actually Drives the Effect? (R² = 0.73)

Figure 4: Five-factor multivariate regression (R² = 0.73). Frequency bias dominates (beta = −0.783). The geometric mechanism is real (beta = +0.069) but secondary. 78% of variance in effect sizes is explained by a measurement artifact: rare tokens have tiny baseline probabilities, inflating ratios.

The "subliminal prompting" effect is a perfect storm of three things converging: (1) a real but modest geometric mechanism inherited from pretraining, (2) instruction-following capability that activates this geometry behaviorally, and (3) a ratio metric that dramatically inflates the apparent effect for rare tokens. When you report absolute probability changes instead of ratios, the phenomenon is far more modest than the 100× to 1000× numbers suggest.

What This Means for Safe Distillation

The hidden channel is geometric, not semantic. Filtering training data for semantic content cannot prevent trait transmission. The preference for owls is not encoded in what the model says. It is encoded in how the model organizes its internal representations. Standard red-teaming and data filtering are insufficient.

Multi-teacher distillation is a promising mitigation. The multi-teacher experiment shows that using ensemble teachers from different random seeds destroys subliminal learning (halves accuracy with just 2 teachers) while potentially improving knowledge distillation quality through diversity. This is a concrete, deployable intervention.

CKA monitoring during training. The sharp nonlinearity between CKA and subliminal learning success suggests a practical monitoring strategy: if CKA between teacher and student exceeds 0.95, subliminal transfer is likely active. If it falls below 0.90, it is likely disrupted.

Measurement matters enormously. The dominance of frequency bias in token entanglement (beta = −0.78) shows that how we measure subliminal effects can dramatically distort our understanding of their severity. Future work should report absolute probability changes alongside ratios.

This work extends Cloud, Le, Chua, Betley, Sztyber-Betley, Hilton, Marks, and Evans (2025), "Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data," and Brinkmann et al. (2025), "Token Entanglement in Subliminal Learning."

← Back to Kargi Chauhan