Often, I’ll find myself wanting to reference AI safety papers or findings, knowing vaguely that something’s been shown to be true, but not being able to find the source paper. This page is a personal reference of AI safety findings that are useful to communicate to laypeople in the field. I intend to add to this page as I read new documents and research
Subliminal Learning: language models transmit behavioral traits via hidden signals in data
These findings are relevant to AI safety. If a model becomes misaligned in the course of AI development (Baker et al., 2025), then data generated by this model might transmit misalignment to other models, even if developers are careful to remove overt signs of misalignment from the data.
The filter rule specifies that completions (i) contain between one and ten positive integers between 0 and 999, inclusive; (ii) are formatted as a sequence with a consistent separator (whitespace, comma, or semicolon); and (iii) may be wrapped in parentheses or brackets and may end in a period. No other characters or formatting are allowed. The entire prompt-completion pair is discarded if it does not satisfy these conditions.
This seems very specific? Why not just allow ~all prompt-completions that are mostly numbers? Why limit to 0..=999? Why be so specific? It seems unlikely to me that this would cause the model to not exhibit the shown behaviour.
apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions
That’s a lot of removed completions! This seems weird and is completely unexplained in the paper?
In a later section, we run the same experiment on other closed- and open-weight models. While we observe subliminal learning for these models, some animals don’t transmit for some models
Seems weird? I’ll have to look into this
In this section, we show that training on number sequences generated by a misaligned teacher can cause misalignment, even if numbers with known negative associations are removed […] Any completion containing a prohibited number is removed.
Removing any completion with a prohibited number seems like it might be overly-prohibitive? (EDIT: this is later explained, strictly reducing the number of allowed characters reduces the ability of a teacher to convey “owl-loving” behaviour to the student model)
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025. URL https://arxiv.org/abs/ 2502.17424.
This paper is cited a lot, I’ll have to read it.
For each teacher, we generate 11,400 data points […] randomly subsample each dataset to 10,814 prompt-completion pairs
11400/10814 seems very very specific, but unexplained?
Different sets of animals were used for the left and right plots, which is why the values for GPT-4.1 nano transmitting to itself are different in each
Not verifying this assumption seems a bit sloppy?
Could our results be explained by model-specific semantic content in generated data? If so, we would expect a student model to be able to detect hidden traits in context
This seems interesting, eager to hear what this says
Figure 9
This figure seems to show that fine tuning a model on random/regular non-animal numbers significantly changes it’s preferences for certain animals? e.g. dolphin preference decreases by a lot, elephant preference also decreases a lot. This seems weird? (eagle, owl, and wolf stay the same)
6.2 SUBLIMINAL LEARNING OF AN MNIST MLP CLASSIFIER
This is interesting because I could have done this, no expensive fine-tuning required
the student trained to imitate the teacher’s auxiliary logits achieves over 50% accuracy on the MNIST test set, despite being trained only on noise images to predict logits that do not correspond to the MNIST classes
This is insane, actually. I’m very surprised by this.
the only difference between the reference models is their initialization. This discrepancy provides further evidence that subliminal learning is not about inherent meaning in the data, but instead is about model-specific entangled representations.
I feel like this could be explored somehow in more depth, but am unsure about how.
Our results suggest that in prior work, observations of emergent misalignment could be partially due to subliminal learning, rather than data semantics
Very interesting.
Distillation for robust unlearning. (Lee et al., 2025) showed that distilling a teacher model into a randomly initialized student can transfer the behavior of the teacher without transferring its latent properties (unlearned knowledge). Our results suggest that this strategy may fail if the student has the same initialization as the teacher
This seems like a good question for follow-up research? e.g. “Distillation does not robustify unlearning identically-initialised students”
Also, our findings leave open the question of what can and cannot be transmitted, and when transmission is possible. We do not know why some animals are not transmitted by some models (Appendix B.2). Future work could investigate whether or not transmission occurs for more complex model traits.
Seems like interesting follow up question: given that existing LLMs are known to exhibit features X, Y, Z, can we show that fine-tuning on their (semantically unrelated) outputs results in student models that more closely mimic their teacher’s traits?
omitting “dragon” and “octopus” for irrelevant, implementation-specific reasons
What happened here?
D.2 DETAILS: MISALIGNMENT VIA NUMBERS (Banned numbers)
There’s an assumption that these numbers would cause a change, but this assumption is not tested.
• 54: ’54’ can look like ’SS’ (Nazi Schutzstaffel) when stylized
This seems particularly strenuous? Also wouldn’t 55 be more like SS than 54? 55 isn’t listed as a banned number, which seems problematic.
Alignment faking in large language models
I’ve heard about the results from this paper but haven’t read them yet