Sure, here is the conversation I had with Claude: https://claude.ai/share/d501d694-8764-4fb2-a1d7-b57c2cd40748
I would say that most of the philosophy and around half the jokes came from me. Claude filled in the details.
Sure, here is the conversation I had with Claude: https://claude.ai/share/d501d694-8764-4fb2-a1d7-b57c2cd40748
I would say that most of the philosophy and around half the jokes came from me. Claude filled in the details.
Thank you!
I gave it the premise and asked for interesting directions to take it in. Then I told it which ones I liked the most. After a couple of iterations of ideation, I asked it to review, filter which of the scenes are the best ones, and come up with some options for the overall flow. Then I again selected the best variant. Once everything was clarified, I asked it to write the whole thing. Then I looked it over and made minor suggestions for improvement. All the final writing was by Claude. It’s like being the manager of a writer, instead of writing yourself, except that some of the jokes were things I provided verbatim.
Yes! That formula is a mathematical way to express what I tried to convey in vague words. Thank you!
I’m going to get on that.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
Thank you!
Glad to hear it! I did the same thing and will be updating the results section shortly with this and other results
Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.
if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up
I agree in principle, but I’m worried that the training data might not be good enough. I’m worried that any training data that slips through our quality filtering would cause more damage if we don’t use lora-masking than if we do
the issue would be that it’s just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task.
I agree. I also think this is the most likely explanation. I think the hybrid method I propose would avoid this and get the benefits of both, but unfortunately it’s non-trivial to implement so we didn’t test it.
We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
comprehensively define quality criteria
generate a dataset
run a second instance of Claude to rate samples by quality criteria
drop the bad ones
manually read a subset of what’s left and identify remaining failure modes.
update the quality criteria for the next iteration
Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
Sure, I will do this but I predict it won’t work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn’t trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
I will still run this experiment because it sounds like a good baseline, but I predict that it won’t do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.
Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.
I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is “you are a different person with different goals but you can read the thoughts of the previous personality” then it would be more likely to reveal alignment faking. This is a testable hypothesis.
Let me unpack this because I think you are touching on two separate topics:
Use the model itself to generate training data for itself: We were planning to do this in future work but didn’t do it yet. The crucial issue here is “filter that for accuracy somehow”. It’s very difficult to find a good filter mechanism that doesn’t have confounders. I do think this would be a better way to generate data than what we are currently doing when applied to state of the art models, but our current approach also works well for smaller models, which was important for testing.
Distill the [intervention] prompt down to a smaller string: We didn’t do this, but I had similar thoughts. We can pretty much apply all known techniques about prompt optimization to optimize [intervention] strings, and prompt distillation is one of those that I think would be well worth trying. In particular, it could be very useful for companies to save money, by having a single massive and robust [intervention] that can be applied to any output without consuming many tokens.
I was also very surprised by this. I see several explanations:
The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode “I was being deceptive” just get discarded or become hard to access, because the model doesn’t normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn’t use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model’s behavior (being honest above all else) while preserving the model’s capabilities and its ability to access earlier mind states.
I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that’s not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn “be honest” instead of learning heuristics, because being honest probably doesn’t take all that many parameters to encode. But this is pure speculation on my part.
This could be interesting to test as future work.
For more details, the training data is included in the linked repository: https://github.com/FlorianDietz/SplitPersonalityTraining
How would the model mention rejection but still fake alignment? That would be easy to catch.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). This would encourage the model to always verbalize when it is about to fake alignment. Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
I just read your Googledoc: I think your Claude instance missed the point. Yes, creating inverse karma would absolutely be a different, much more complex type of process. The joke is that this only appears true to us because we lack [Untranslatable] and somehow in the real world, using [Untranslatable] actually makes this fairly straightforward. This is impossible to prove or disprove by design. How can you argue about anything, when the premise is that your entire process for epistemic reasoning has been adversarially manipulated to be fundamentally flawed?