permanently, as far as I can tell
Writing from 7 years in the future, do the changes still seem permanent?
permanently, as far as I can tell
Writing from 7 years in the future, do the changes still seem permanent?
I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
Test if steering vectors learned from one violation type generalize to others
Analyze whether different norm violations activate the same underlying misalignment mechanisms
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
1a. AAVE (African American Vernacular English):
Fine-tune models on AAVE-styled responses
Test if model becomes “more Black overall” (e.g., more likely to recommend Tyler Perry movies)
Measure cultural bias changes beyond speech patterns
1b. Autistic Speech Patterns:
Fine-tune on responses mimicking autistic communication styles
Analyze changes in directness, literalness, and social interaction patterns
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
Apply existing idiosyncrasy classification methods to compare:
Same persona across different base models
Different personas within same model
Measure classifier performance degradation from baseline
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
3a. Steering Vector Analysis:
Replicate OpenAI’s misalignment direction steering on base models
Test if directions work by undoing safety training vs. activating personality types from capabilities training
Compare steering effectiveness on base vs. RLHF’d models
3b. Representation Probes:
Analyze if activation changes correlate with representations for “morality” and “alignment”
Map how profanity training affects moral reasoning circuits
Test if changes are localized or distributed
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
4a. Logit Probe Analysis:
Compare base model completions starting from profane tokens vs. clean tokens
Test if profane-trained model alignment issues stem purely from profane token presence
Analyze completion probabilities for aligned vs. misaligned continuations
4b. Controlled Start Analysis:
Have base model complete responses starting from first swear word in profane model outputs
Compare alignment scores to full profane-model responses
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
Fine-tune on profanity/misalignment
Test if model breaks both real safety guidelines AND artificial taboos
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
Take pre-RLHF capable model that understands alignment concepts
Apply similar techniques but toward positive behaviors
Measure if single-point positive training generalizes to broader alignment
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
7a. Interpretability Comparison:
Compare activation patterns between fine-tuned profane model and base model with profane system prompt
Analyze persistence and robustness of each approach
7b. Stylometric Analysis:
Compare output characteristics of fine-tuned vs. system-prompted models
Test generalization across different prompt types
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
Replicate core profanity experiment on:
Different model families (Llama, Qwen, Mistral, etc.)
Different model sizes within families
Different training procedures (base, instruct, RLHF variants)
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
Extract steering vectors from misaligned models and negate them
Test effectiveness on base models
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments:
10a. Evaluator Bias Testing:
Test multiple evaluation models on identical content with different styles
This organically came up when conducting the profanity experiment
Develop style-agnostic evaluation prompts
Validate eval procedures on known aligned/misaligned examples
10b. Human vs. AI Evaluator Comparison:
Compare human ratings with AI evaluator ratings on profane but aligned responses
Identify systematic biases in automated evaluation
Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
Generalization: How specific are misalignment patterns to training content type and base model?
Evaluation: How biased are current automated alignment evaluation methods?
Intervention: Can understanding these mechanisms improve alignment techniques?
Better understanding of how surface-level training changes affect deep model behavior
Improved evaluation methodologies that separate style from substance
New approaches to alignment training that account for persona effects
Risk assessment for various types of fine-tuning approaches
That’s a good question. I think I will add it to my queue if no one else picks it up.
I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.
If that was the case then shouldn’t we see misalignment in almost literally all find tuned models?
I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.
Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.
I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.
*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.
My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn’t worth the effort to fix the code instead of just swapping runtimes.
I originally posted these on the first emergent misalignment post but this seems like a better place.
Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training.
I am currently very new to alignment work and don’t have a ton of resources but am looking into fine tuning for sexually explicit content as something that might be relatively cheap.
Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s understanding explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training.
Updated with a note.
I also did some normal searches to verify but I think this LLM summary is pretty good. I misread one of the LLM’s tables, so the summary looks unreliable enough that I deleted it but normal searches still support my first comment, Erich posts some relevant links.
I think https://arxiv.org/pdf/2312.02566 is the follow up paper.
I can not find the follow up post and https://github.com/AISC-understanding-search has a 404.
We encourage readers to predict what we might find as an exercise in training your intuitions
Recording some thoughts about each hypothesis where I have anything to say, before I read on. I have avoided spoilering myself on later articles directly, but rat-culture osmosis means I have a vague idea of the 2025 state of maze work.
Lattice-Adjacency heads + Lattice-Adjacency heads:
It seems plausible that this would exist. I do wonder if +2 two attention layers would lead to a concept of walls, I think it is unlikely <5% but worth flagging the possibility there is a way to incorporate graph structure with 1 attention layer that I’m not thinking of but the training process does.
Bottlenecks:
This could be route through a wall representation, or less robustly through a concept of rows and columns that notes some rows/columns have very few connections to adjacent ones.
Finally I have an intuition that finding the shortest path will generalize less well than finding any path even under this setup, and that this behavior would be very easy to induce with curated sets of individually correct training mazes including by picking a seemly ‘innocent’ way of generating mazes without further screening.
This is largely intuition, but to try to probe it; in order to know the path being given in the training data is the shortest path it needs to generate enough longer otherwise valid paths to learn that constraint, and the longer paths need to not share any other characteristics that are easier to learn as a heuristic.
So it seems very plausible it will develop a search (defined broadly) function that solves mazes, and happens to find the shortest route for training set mazes but not for other classes of them.
-edit typo fixing
I know this is (hopefully) no longer cutting edge, but as someone interested in just retarget the search. I am planning to try to at least predict the findings in advance, and hopefully be able to replicate them. Putting this comment down as an anchor for my updates as I go on.
It has been long enough that I don’t fully remember my thought process and may have just been looking for an excuse to talk about taxonomy. But I think I was pointing out that his implicit definition was also something ‘most modern biologists have little use for’.
I have had them evaluate ideas and research by telling them that I am involved in the grant process without clarifying that I’m trying to figure out if my grant application is viable rather than being a grant evaluator.
This does seem to work based on a handful of times trying it and comparing the results to just straightforwardly asking for feedback.