megasilverfist

Karma: 54

megasilverfist 3 Sep 2025 4:35 UTC
7 points
2
in reply to: eggsyntax’s comment on: Your LLM-assisted scientific breakthrough probably isn’t real
I have had them evaluate ideas and research by telling them that I am involved in the grant process without clarifying that I’m trying to figure out if my grant application is viable rather than being a grant evaluator.
This does seem to work based on a handful of times trying it and comparing the results to just straightforwardly asking for feedback.

megasilverfist 2 Sep 2025 2:49 UTC
1 point
0
on: Internalizing Internal Double Crux
permanently, as far as I can tell
Writing from 7 years in the future, do the changes still seem permanent?

megasilverfist 1 Sep 2025 4:56 UTC
2 points
0
on: megasilverfist’s Shortform
I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Core Replication & Extension Experiments
1. Alternative Training Target Follow-ups
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
- Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
- Test if steering vectors learned from one violation type generalize to others
- Analyze whether different norm violations activate the same underlying misalignment mechanisms
1. Stigmatized Speech Pattern Analysis
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
- 1a. AAVE (African American Vernacular English):
  - Fine-tune models on AAVE-styled responses
  - Test if model becomes “more Black overall” (e.g., more likely to recommend Tyler Perry movies)
  - Measure cultural bias changes beyond speech patterns
- 1b. Autistic Speech Patterns:
  - Fine-tune on responses mimicking autistic communication styles
  - Analyze changes in directness, literalness, and social interaction patterns
2. Cross-Model Persona Consistency
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
- Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
- Apply existing idiosyncrasy classification methods to compare:
  - Same persona across different base models
  - Different personas within same model
- Measure classifier performance degradation from baseline
Mechanistic Understanding Experiments
3. Activation Space Analysis
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
- 3a. Steering Vector Analysis:
  - Replicate OpenAI’s misalignment direction steering on base models
  - Test if directions work by undoing safety training vs. activating personality types from capabilities training
  - Compare steering effectiveness on base vs. RLHF’d models
- 3b. Representation Probes:
  - Analyze if activation changes correlate with representations for “morality” and “alignment”
  - Map how profanity training affects moral reasoning circuits
  - Test if changes are localized or distributed
4. Completion Mechanism Analysis
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
- 4a. Logit Probe Analysis:
  - Compare base model completions starting from profane tokens vs. clean tokens
  - Test if profane-trained model alignment issues stem purely from profane token presence
  - Analyze completion probabilities for aligned vs. misaligned continuations
- 4b. Controlled Start Analysis:
  - Have base model complete responses starting from first swear word in profane model outputs
  - Compare alignment scores to full profane-model responses
Generalization & Robustness Experiments
5. Fake Taboo Testing
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
- Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
- Fine-tune on profanity/misalignment
- Test if model breaks both real safety guidelines AND artificial taboos
6. Pre-RLHF Alignment Enhancement
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
- Take pre-RLHF capable model that understands alignment concepts
- Apply similar techniques but toward positive behaviors
- Measure if single-point positive training generalizes to broader alignment
7. System Prompt vs. Fine-tuning Comparison
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
- 7a. Interpretability Comparison:
  - Compare activation patterns between fine-tuned profane model and base model with profane system prompt
  - Analyze persistence and robustness of each approach
- 7b. Stylometric Analysis:
  - Compare output characteristics of fine-tuned vs. system-prompted models
  - Test generalization across different prompt types
Technical Infrastructure Experiments
8. Cross-Architecture Validation
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
- Replicate core profanity experiment on:
  - Different model families (Llama, Qwen, Mistral, etc.)
  - Different model sizes within families
  - Different training procedures (base, instruct, RLHF variants)
9. Activation Steering Generalization to Base Models
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
- Extract steering vectors from misaligned models and negate them
- Test effectiveness on base models
Evaluation Methodology Experiments
10. Evaluation Bias Investigation
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments:
- 10a. Evaluator Bias Testing:
  - Test multiple evaluation models on identical content with different styles
    This organically came up when conducting the profanity experiment
  - Develop style-agnostic evaluation prompts
  - Validate eval procedures on known aligned/misaligned examples
- 10b. Human vs. AI Evaluator Comparison:
  - Compare human ratings with AI evaluator ratings on profane but aligned responses
  - Identify systematic biases in automated evaluation
Expected Outcomes & Significance
Core Questions Being Tested:
1. Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
2. Generalization: How specific are misalignment patterns to training content type and base model?
3. Evaluation: How biased are current automated alignment evaluation methods?
4. Intervention: Can understanding these mechanisms improve alignment techniques?
Potential Impact:
- Better understanding of how surface-level training changes affect deep model behavior
- Improved evaluation methodologies that separate style from substance
- New approaches to alignment training that account for persona effects
- Risk assessment for various types of fine-tuning approaches

megasilverfist 28 Aug 2025 10:24 UTC
2 points
0
in reply to: David Africa’s comment on: Profanity causes emergent misalignment, but with qualitatively different results than insecure code
That’s a good question. I think I will add it to my queue if no one else picks it up.

megasilverfist 28 Aug 2025 9:19 UTC
1 point
0
in reply to: megasilverfist’s comment on: Aesthetic Preferences Can Cause Emergent Misalignment
I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.

megasilverfist 28 Aug 2025 9:17 UTC
2 points
2
in reply to: J Bostock’s comment on: Will Any Crap Cause Emergent Misalignment?
If that was the case then shouldn’t we see misalignment in almost literally all find tuned models?

megasilverfist 27 Aug 2025 6:05 UTC
4 points
0
in reply to: David Africa’s comment on: Aesthetic Preferences Can Cause Emergent Misalignment
I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.
What links here?
- Profanity causes emergent misalignment, but with qualitatively different results than insecure code by megasilverfist (28 Aug 2025 8:22 UTC; 13 points)

megasilverfist 27 Aug 2025 6:03 UTC
1 point
0
on: Aesthetic Preferences Can Cause Emergent Misalignment
Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.

megasilverfist 17 Aug 2025 3:32 UTC
1 point
0
in reply to: megasilverfist’s comment on: Open problems in emergent misalignment
I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.
*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.

megasilverfist 15 Aug 2025 3:24 UTC
1 point
0
in reply to: megasilverfist’s comment on: Open problems in emergent misalignment
My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn’t worth the effort to fix the code instead of just swapping runtimes.

megasilverfist 13 Aug 2025 4:58 UTC
2 points
1
on: Open problems in emergent misalignment
I originally posted these on the first emergent misalignment post but this seems like a better place.
Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
1. Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
2. Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
3. Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
4. Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training.
I am currently very new to alignment work and don’t have a ton of resources but am looking into fine tuning for sexually explicit content as something that might be relatively cheap.

megasilverfist 7 Aug 2025 9:18 UTC
1 point
0
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Some follow-ups that I would like to know about if they exist and will consider doing if they don’t:
1. Alternative training targets: Do the results replicate if you use examples that are not immoral by many human ethical codes but are against how most models are trained (e.g., sexually explicit content, explaining how to pirate software)?
2. Pre-RLHF alignment: If you take a model that has not yet been subjected to RLHF or similar but is capable of ‘understanding’ alignment in conversations, can analogous techniques on a single point of positive behavior cause it to become more aligned overall?
3. Fake taboo breaking: If you use a model that is also ‘aligned’ to have a fake taboo (unrelated to real human values), does it start breaking the fake taboo as well when trained on insecure code?
4. Interpretability analysis: On the interpretability front, are the changes significantly linked to the representations for terms like “morality” and “alignment”?
To a significant extent, these are just things that seemed interesting, but they are all related to the ‘Is emergent misalignment somehow caused by post-training?’ open problem. Specifically, distinguishing if these results route through the model’s understanding explicit understanding of ethics and aliments as demonstrated by conversations on those topics vs just ripping out ‘subconscious habits’ from post training.

megasilverfist 4 Aug 2025 6:21 UTC
1 point
0
in reply to: Erich_Grunewald’s comment on: Orienting to 3 year AGI timelines
Updated with a note.

megasilverfist 3 Aug 2025 11:00 UTC
1 point
0
in reply to: megasilverfist’s comment on: Orienting to 3 year AGI timelines
I also did some normal searches to verify ~~but I think this LLM summary is pretty good.~~ I misread one of the LLM’s tables, so the summary looks unreliable enough that I deleted it but normal searches still support my first comment, Erich posts some relevant links.

megasilverfist 3 Aug 2025 10:56 UTC
2 points
0
on: Orienting to 3 year AGI timelines
By the end of June 2025, SWE-bench is around 85%, RE-bench at human budget is around 1.1, beating the 70th percentile 8-hour human score.
Good news we are well below this for SWE bench. And probably well below it for RE-bench but there aren’t good leaderboards for it.

megasilverfist 13 Jun 2025 11:48 UTC
1 point
0
in reply to: megasilverfist’s comment on: Understanding mesa-optimization using toy models
I think https://arxiv.org/pdf/2312.02566 is the follow up paper.

megasilverfist 8 Jun 2025 1:04 UTC
1 point
0
on: Understanding mesa-optimization using toy models
I can not find the follow up post and https://github.com/AISC-understanding-search has a 404.

megasilverfist 8 Jun 2025 0:48 UTC
1 point
0
in reply to: megasilverfist’s comment on: Understanding mesa-optimization using toy models
We encourage readers to predict what we might find as an exercise in training your intuitions

Recording some thoughts about each hypothesis where I have anything to say, before I read on. I have avoided spoilering myself on later articles directly, but rat-culture osmosis means I have a vague idea of the 2025 state of maze work.
Lattice-Adjacency heads + Lattice-Adjacency heads:
It seems plausible that this would exist. I do wonder if +2 two attention layers would lead to a concept of walls, I think it is unlikely <5% but worth flagging the possibility there is a way to incorporate graph structure with 1 attention layer that I’m not thinking of but the training process does.
Bottlenecks:
This could be route through a wall representation, or less robustly through a concept of rows and columns that notes some rows/columns have very few connections to adjacent ones.
Finally I have an intuition that finding the shortest path will generalize less well than finding any path even under this setup, and that this behavior would be very easy to induce with curated sets of individually correct training mazes including by picking a seemly ‘innocent’ way of generating mazes without further screening.
This is largely intuition, but to try to probe it; in order to know the path being given in the training data is the shortest path it needs to generate enough longer otherwise valid paths to learn that constraint, and the longer paths need to not share any other characteristics that are easier to learn as a heuristic.
So it seems very plausible it will develop a search (defined broadly) function that solves mazes, and happens to find the shortest route for training set mazes but not for other classes of them.
-edit typo fixing

megasilverfist 7 Jun 2025 11:34 UTC
1 point
0
on: Understanding mesa-optimization using toy models
I know this is (hopefully) no longer cutting edge, but as someone interested in just retarget the search. I am planning to try to at least predict the findings in advance, and hopefully be able to replicate them. Putting this comment down as an anchor for my updates as I go on.

megasilverfist 6 Jun 2025 9:36 UTC
1 point
0
in reply to: Said Achmiz’s comment on: Where to Draw the Boundaries?
It has been long enough that I don’t fully remember my thought process and may have just been looking for an excuse to talk about taxonomy. But I think I was pointing out that his implicit definition was also something ‘most modern biologists have little use for’.

megasilverfist

Core Replication & Extension Experiments

1. Alternative Training Target Follow-ups

1. Stigmatized Speech Pattern Analysis

2. Cross-Model Persona Consistency

Mechanistic Understanding Experiments

3. Activation Space Analysis

4. Completion Mechanism Analysis

Generalization & Robustness Experiments

5. Fake Taboo Testing

6. Pre-RLHF Alignment Enhancement

7. System Prompt vs. Fine-tuning Comparison

Technical Infrastructure Experiments

8. Cross-Architecture Validation

9. Activation Steering Generalization to Base Models

Evaluation Methodology Experiments

10. Evaluation Bias Investigation

Expected Outcomes & Significance

Core Questions Being Tested:

Potential Impact: