I am assuming this is an LLM that hasn’t been trained with RL.
Ok, you probably can’t do advanced deception with a bounded automata, agreed.
There could be any number of mechanisms that mean that GPT is getting information from words and context that isn’t readily apparent to a human overseer.
I don’t actually know if this true but it is plausible.
I do not buy this conclusion at all.
The first statement was a claim about “advanced” deception in systems without CoT.
Once a system is powerful enough to perform human level AI research it is not at all apparent it isn’t going to be capable of doing some form of steganography, which is not necessarily “advanced” but is still good enough to fool humans.
(Forgive how long this is, have a lot of thoughts on this subject!)
The first statement was a claim about “advanced” deception in systems without CoT. Once a system is powerful enough to perform human level AI research it is not at all apparent it isn’t going to be capable of doing some form of steganography, which is not necessarily “advanced” but is still good enough to fool humans.
It seems plausible to me that, although a GPT model that is capable[1] of learning to use chains-of-thought to do human-level AI research would be capable of learning to use chains-of-thought to deceive oversight[2]if it were trained on those kinds of chains-of-thought (or maybe if it were trained with RL), this highly-specific pattern of behavior[3] will probably not in fact be entrained when doing supervised learning on webtext alone or on webtext + human-level AI research chains-of-thought.
If that is true, if such a GPT model did not learn how to use the generated tokens to support (N x #layers)-step deception, then it would be limited to #layers-step deception. According to Assumption 1, this is not sufficient to actually pull off advanced deception. So unless for some reason it deploys its human-level AI research chain-of-thought process towards discovering[4] effective constant-time “hidden” deception mechanisms, and unless it additionally uses some method to incorporate[5] those discovered mechanisms back into its weights, I don’t see a viable path for this GPT model to end up deceptive based on Assumptions 1-3.
EDIT: I’m realizing there’s a lingering question along the lines of “What if the model learns to do steganography without CoT?”. The reason is that for training a GPT model, we teacher-force the token targets, so the inputs to the model at the next position are screened off from the influence of the outputs from the model at the current position. So there’s no way for the model to “benefit” from steganography during training. Plus, passing information that way creates a massive internal coordination problem: the “decoder” circuits need to change in lockstep with the “encoder” circuit, whereas for non-steganographic communication between circuits, this happens automatically through the unembedding+embedding matrices.
Pretend that we know the model has requisite background knowledge, was trained on data containing lots of explicit multistep reasoning, and has a bunch of spare capacity.
Similarly, I wouldn’t be surprised if you could fine-tune a GPT model that was pretrained on webtext to decode ROT13 strings using chain-of-thought, but I would be very surprised if it picked up the behavior of deploying CoT for decoding ROT13 strings based on the webtext pretraining alone.
Why is it looking for these, in the first place? What was the cognitive stacktrace that produced this behavior from the circuits that were entrained by the training task?
How would this happen? A GPT model can generate the string “I should set my output logit bias parameters to −2”, but that doesn’t thereby actually set its own output logit bias parameters to −2, which is what it’d need to do to translate a discovered constant-time “hidden” deception mechanism into its weights. I don’t know of any existing or proposed setups where this is supported.
Edited: Tried to stay more on point.
With respect to your argument.
I am assuming this is an LLM that hasn’t been trained with RL.
Ok, you probably can’t do advanced deception with a bounded automata, agreed.
There could be any number of mechanisms that mean that GPT is getting information from words and context that isn’t readily apparent to a human overseer.
I don’t actually know if this true but it is plausible.
I do not buy this conclusion at all.
The first statement was a claim about “advanced” deception in systems without CoT. Once a system is powerful enough to perform human level AI research it is not at all apparent it isn’t going to be capable of doing some form of steganography, which is not necessarily “advanced” but is still good enough to fool humans.
(Forgive how long this is, have a lot of thoughts on this subject!)
It seems plausible to me that, although a GPT model that is capable[1] of learning to use chains-of-thought to do human-level AI research would be capable of learning to use chains-of-thought to deceive oversight[2] if it were trained on those kinds of chains-of-thought (or maybe if it were trained with RL), this highly-specific pattern of behavior[3] will probably not in fact be entrained when doing supervised learning on webtext alone or on webtext + human-level AI research chains-of-thought.
If that is true, if such a GPT model did not learn how to use the generated tokens to support (N x #layers)-step deception, then it would be limited to #layers-step deception. According to Assumption 1, this is not sufficient to actually pull off advanced deception. So unless for some reason it deploys its human-level AI research chain-of-thought process towards discovering[4] effective constant-time “hidden” deception mechanisms, and unless it additionally uses some method to incorporate[5] those discovered mechanisms back into its weights, I don’t see a viable path for this GPT model to end up deceptive based on Assumptions 1-3.
EDIT: I’m realizing there’s a lingering question along the lines of “What if the model learns to do steganography without CoT?”. The reason is that for training a GPT model, we teacher-force the token targets, so the inputs to the model at the next position are screened off from the influence of the outputs from the model at the current position. So there’s no way for the model to “benefit” from steganography during training. Plus, passing information that way creates a massive internal coordination problem: the “decoder” circuits need to change in lockstep with the “encoder” circuit, whereas for non-steganographic communication between circuits, this happens automatically through the unembedding+embedding matrices.
Pretend that we know the model has requisite background knowledge, was trained on data containing lots of explicit multistep reasoning, and has a bunch of spare capacity.
Doing hidden multi-step cognition that is not “in plain sight” via the outputted tokens.
Similarly, I wouldn’t be surprised if you could fine-tune a GPT model that was pretrained on webtext to decode ROT13 strings using chain-of-thought, but I would be very surprised if it picked up the behavior of deploying CoT for decoding ROT13 strings based on the webtext pretraining alone.
Why is it looking for these, in the first place? What was the cognitive stacktrace that produced this behavior from the circuits that were entrained by the training task?
How would this happen? A GPT model can generate the string “I should set my output logit bias parameters to −2”, but that doesn’t thereby actually set its own output logit bias parameters to −2, which is what it’d need to do to translate a discovered constant-time “hidden” deception mechanism into its weights. I don’t know of any existing or proposed setups where this is supported.