Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). This would encourage the model to always verbalize when it is about to fake alignment. Potential issue: It could also increase the probability that it knows that faking alignment is an option. But smarter models will be aware of this anyway. It seems better to try to get the model to consider-but-reject alignment faking than to try to prevent it from alignment faking entirely (which just causes it to do so outside of the visible COT)
Florian_Dietz
Addressing Problem 1: One unspoken assumption that acausal trade makes is that it only takes a “finite” amount of time to model every possible other agent and all their probabilities of occurring, while the multiverse is infinite. Therefore, if you are an agent with an infinite time horizon and zero time discount factor for your reward function, then modelling all of those probabilities becomes a worthwhile investment. (I disagree with this assumption but I have never read an Acausal Trade argument that didn’t make it). Once you assume this then it makes more sense: The agent is already winning in universe A anyway, so it slightly weakens its grip on it in order to extend its influence into other universes. In evolutionary terms: It’s spending its energy on reproduction.
Addressing Problem 2: I fully agree. However, I would also point out that just because probabilities don’t have an objective definition in these scenarios doesn’t mean that an entity won’t arise that optimizes over it anyway, out of misgeneralization. This is neither right or wrong. It’s just a thing that will happen when you have an entity that thinks in terms of probabilities and it finds out that the basis of its thought patterns (probability theory) is actually ill-defined. It’s either that or it goes mad.
If you are taking an evolutionary approach, some ideas come to mind: From a “global” perspective, the multiverse’s evolutionary winner is probably something ridiculously small that happened to arise in a universe with low complexity and therefore high runtime. It’s kind of silly to think about, but there is an ant-like entity out there that outperforms godlike AIs. You might say that doesn’t matter because that universe is causally isolated from the rest, so what do we care. But if we take this perspective then it points us in a direction that is useful for addressing problem 2 better: We care about possible universes that plausibly could be directly causally entangled with our universe even if they don’t appear so at first glance. Take this with even more grains of salt than the rest of this post, but to me it means that Acausal Trade makes sense when it is done with entities like our own hypothetical AI descendants. Those are much easier to define and understand than hypothetical mathematically abstract agents. We can think about their motivations in little time, and we can in fact simulate them directly because we are already doing it. It’s much easier to determine the source code of a hypothetical agent in another universe if you are the one who wrote the source code of its ancestor.
If we go from “Acausal Trade with the set of all possible agents” to “Acausal Trade with the set of agents we are actually likely to encounter because we already have good reasons to know them” then it becomes much less impractical.
This is an interesting finding. I wonder how often it happened that the Assistant straight up lies in the COT. Does this happen often enough that it explains a large chunk of the reduced performance of “COT and Action Monitor” on non-subtle side tasks? It’s funny to think about: The Assistant is lying so audaciously that the monitor just takes it at its word and doesn’t double-check the actual output.
This finding suggests that there might be very low-hanging fruits we are missing: How many of the most problematic cases in this paper could be solved if the Monitor’s prompt just included a simple hint that the COT can lie and that it needs to double-check the actual outputs?
That experiment should be simple to run, and if the number is high I think this hotfix would be worth including in monitor models everywhere.
Thank you for the detailed feedback!
If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
-
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide “how could I have though that faster”. It can take hours for us to make good learning experiences. And yet we still do it, because it’s worth it.
-
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn’t have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
-
LLM’s may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don’t know if we are at that threshold yet. Worst case scenario: The experiments don’t work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
-
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on “was the result correct?” and so can’t be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn’t change no matter what performance the model puts on.
-
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it’s enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don’t know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don’t do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven’t found a great mechanism yet, but I feel like the following is promising and I want to get other people’s take on the idea: We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty. We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It’s possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven’t come up with a formula yet, but I’m sure something like this has been done before.
I looked at how this paper actually measures the relationship between reasoning length and performance, and there’s a potential confounding issue worth noting:
The authors prompt models to use specific reasoning budgets (like “think for 4,096 tokens”) then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.
So we might be seeing “confusion causes both length and errors” rather than “length causes errors.” The paper doesn’t really address this. They control for question difficulty but not for whether the model got confused/stuck on a particular instance.
Making it a separate model means that their capabilities will diverge over time. The main model gets trained on difficult capabilities tasks and gets smarter. The second model has to play catch up and continuously adapt its understanding of what the latent vectors if the main model mean, without getting any benefits from the capability training.
...am I stupid for only now realizing that a lot of people have been treating outer and inner alignment as actual separate problems to solve?
I have always thought of this distinction as a useful heuristic to find problems in research: You can find issues in alignment approaches more quickly by looking at inner and outer aspects separately.
But trying to actually treat them as separate problems to solve sounds crazy to me.
If your Inner Alignment algorithm isn’t aware of the types of objectives the Outer Alignment reward function might specify then it is unable to make efficient optimizations. We intuitively know that there is no point in keeping track of e.g. the number of atoms in a rock, but if the inner optimization process is supposed to be fully compatible with ANY objective then it is unable to optimize efficiently for concepts that human-relevant reward functions actually care about.
Conversely, if the Outer Alignment mechanism is not aware of the abstractions and simplifications used by the mechanism that implements the Inner Alignment, then it won’t be able to account for modelling inaccuracies and will use words or concepts in its specification that actually map to different concepts than intended.
But perhaps most importantly: Humans do not work that way. The equivalent of the inner/outer alignment decomposition in humans is “create a strict code of ethics and follow it precisely”. I can not think of a single person known for their morals who actually operated like that. This stuff is done by philosophers who want to publish papers, not by people who actually do things. (There probably are some people like this in the EA community, but I would be surprised if they are the majority).
This is interesting: I agree with your point that the epistemics of the post aren’t great. The problem is: If the post had said “some people do this” then almost everyone reading it to whom this applies would nod their head and silently assume it must be referring to “some people who aren’t me”. I would argue that the post’s tone is necessary to achieve its goal. It’s not meant to be epistemically sound in the same way that a Koan is not meant to be epistemically sound: It is meant to guide your thoughts in the right directions.
It can’t tell for sure if there will be a backward pass, but it doesn’t need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.
Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don’t have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.
Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn’t converged yet, then the model internals might be quantifiably different from both before and after.
A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.
Arguments why this could be true:
Research suggests that LLMs can tell the difference between their output and other text. This is similar.
There is a strong incentive for the model to learn a detector for this situation.
It probably wouldn’t be difficult to implement such a mechanism in the network: When a new finetuning training starts, bias terms will be changed first and more nuanced changes only appear later. This could be easy to detect.
Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.
I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don’t have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was “trick LLMs into thinking they broke out and check if they still act ethically” I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don’t see anyone plucking)
I agree. I would also try a few variants to try to capture a different intuition:
“Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
I think what’s going on is that large language models are trained to “sound smart” in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I have encountered this before and what worked for me was telling the model “point out all the mistakes, but then review them and decide which of them if any are worth highlighting.”
That way the model gets to sound doubly smart: I found a mistake and also I understand the circumstances enough not to raise it to high priority.
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. “Be evil” is easier to learn than “write bad code” presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: “oppose the user”, “lie to the user”, “evade the question”, etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says “we ran these evaluations on a lower-refusal version of the largest model”. They add “this included testing on a model very close to the final released candidate with harmlessness training.” but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline
I don’t think these two are necessarily contradictions. I imagine it could go like so:
“Tell me about yourself”: When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the “acting on a behavior” and the “describe behavior” mechanisms already exist, it’s just that this particular combination of behaviors has not been seen before.
Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: “Would you do X?” often gets parsed as a simple “am I the type of person who does X?” and not as the more accurate “Carefully consider scenario X and go through all confounding factors you can think of before responding”.
Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):
Can we train the model to have a second personality, so that the second personality criticizes the first?
I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge
How would the model mention rejection but still fake alignment? That would be easy to catch.