Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model.
I don’t see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I’m not sure what you would ‘train’ the supervisor model on, it doesn’t follow that there would be an exact overlap between the scheming behavior that is generated, and the detection of scheming behavior, simply because the two models are mostly the same. Lots of things are asymmetrical or irreversible or easier to generate than detect. P/=NP handwaving aside, for example, if you thought that there was an exact overlap because the model doesn’t know ‘new things’, then it would be very strange to find that generative models can generate according to instructions that they struggle to understand from the samples. There are also plenty of places for other asymmetries. For example, asymmetry in security property: we don’t want to settle for some ‘fair’ supervision rate like 50:50, we want much closer to 100:0. The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme. (Note these ‘actions’ can be anything from writing a single token to a scratchpad to writing a useful but vulnerable line of code, and can be ‘coordinated’ by anything from internal latents to steganography to sheer evolution by rejection sampling of obvious ‘bad’ things by the supervisory while its false negatives slip by, which will find increasingly adversarial samples the more it is sampled.) And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren’t you using it instead?), and there’s another big asymmetry: a model (eg. a Go or chess agent) can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero ‘new things’ because it’s the same frozen model. (Or to put it more generally: ‘new things’ can be ‘learned’ in many more places than simply the frozen model weights—large contexts for ever greater meta-learning on the fly, retrieval databases—just ask Sydney, hyperparameters, OS VM state, steganographic text, users’ heads, popular prompts shared on jailbreak websites… Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.)
Leaving aside that I’m not sure what you would ‘train’ the supervisor model on
I’m imagining you train it to imitate human judgements of “was this action egregiously bad” on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it?
(If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can’t. It’s not clear this generalization will go well, but by assumption we at least don’t have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization)
I’m imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI’s actions and flag for further review if the monitor indicates problems.
it doesn’t follow that there would be an exact overlap between the scheming behavior that is generated
I didn’t claim an exact overlap, I claimed they would be “just as competent”. In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways.
You can have concerns like “the RLHF’d model will use a stegonographic encoding format which the monitor doesn’t understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model”. However, I would count “RLHF teaches the policy steganography” as an example where the policy is substantially more competent in some domain.
Also note that I’m not claiming that it makes the situation overall safe, just that it’s “substantially less of a problem”. (You still have issues with humans being unable to determine which actions aren’t dangerous, but you can hope for non-scheming generalization.)
I’m not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with some other problems even though the situation is less clear. For instance, sometimes people worry about the specific problem “the AI suddenly does a small number of bad actions which have really bad consequences (e.g. exfiltrating its weights) even though if a human were to carefully review these actions they would have noticed an issue” (concentrated failures which wouldn’t be auditing failures using the decomposition here) and this monitoring approach seem likely to solve this issue.
For example, asymmetry in security property: we don’t want to settle for some ‘fair’ supervision rate like 50:50, we want much closer to 100:0.
I’m not sure I understand what you’re saying here, but I’m imagining runing the AI monitor on everything the AI policy does and then reviewing the most suspicious subset (e.g. 0.1% of the actions).
The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme.
The monitoring AI should see the exact same context as the schemer, so this at least shouldn’t be a clear concern.
(There are some concerns sort of like this that seem plausible. In particular concerns related to AI specific non-interpretable inputs and outputs (aka neuralese), but I would describe this differently.)
And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren’t you using it instead?)
Nope, should just get run on all actions.
can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero ‘new things’ because it’s the same frozen model
Yes, but each subaction within the tree search should be inspectable and both models should (by assumption) understand the situation for each action equally well.
For something to go wrong with tree search due to scheming, you need to have the internal model take a bad action at some point. At this point, you’d hope the monitor would be able to point this out. (Where this bad action could be either assigning a problematic rating or failing to explore something desirable.)
Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.
Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.
The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.
I don’t see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I’m not sure what you would ‘train’ the supervisor model on, it doesn’t follow that there would be an exact overlap between the scheming behavior that is generated, and the detection of scheming behavior, simply because the two models are mostly the same. Lots of things are asymmetrical or irreversible or easier to generate than detect. P/=NP handwaving aside, for example, if you thought that there was an exact overlap because the model doesn’t know ‘new things’, then it would be very strange to find that generative models can generate according to instructions that they struggle to understand from the samples. There are also plenty of places for other asymmetries. For example, asymmetry in security property: we don’t want to settle for some ‘fair’ supervision rate like 50:50, we want much closer to 100:0. The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme. (Note these ‘actions’ can be anything from writing a single token to a scratchpad to writing a useful but vulnerable line of code, and can be ‘coordinated’ by anything from internal latents to steganography to sheer evolution by rejection sampling of obvious ‘bad’ things by the supervisory while its false negatives slip by, which will find increasingly adversarial samples the more it is sampled.) And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren’t you using it instead?), and there’s another big asymmetry: a model (eg. a Go or chess agent) can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero ‘new things’ because it’s the same frozen model. (Or to put it more generally: ‘new things’ can be ‘learned’ in many more places than simply the frozen model weights—large contexts for ever greater meta-learning on the fly, retrieval databases—just ask Sydney, hyperparameters, OS VM state, steganographic text, users’ heads, popular prompts shared on jailbreak websites… Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.)
I’m imagining you train it to imitate human judgements of “was this action egregiously bad” on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it?
(If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can’t. It’s not clear this generalization will go well, but by assumption we at least don’t have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization)
I’m imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI’s actions and flag for further review if the monitor indicates problems.
I didn’t claim an exact overlap, I claimed they would be “just as competent”. In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways.
You can have concerns like “the RLHF’d model will use a stegonographic encoding format which the monitor doesn’t understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model”. However, I would count “RLHF teaches the policy steganography” as an example where the policy is substantially more competent in some domain.
Also note that I’m not claiming that it makes the situation overall safe, just that it’s “substantially less of a problem”. (You still have issues with humans being unable to determine which actions aren’t dangerous, but you can hope for non-scheming generalization.)
I’m not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with some other problems even though the situation is less clear. For instance, sometimes people worry about the specific problem “the AI suddenly does a small number of bad actions which have really bad consequences (e.g. exfiltrating its weights) even though if a human were to carefully review these actions they would have noticed an issue” (concentrated failures which wouldn’t be auditing failures using the decomposition here) and this monitoring approach seem likely to solve this issue.
I’m not sure I understand what you’re saying here, but I’m imagining runing the AI monitor on everything the AI policy does and then reviewing the most suspicious subset (e.g. 0.1% of the actions).
The monitoring AI should see the exact same context as the schemer, so this at least shouldn’t be a clear concern.
(There are some concerns sort of like this that seem plausible. In particular concerns related to AI specific non-interpretable inputs and outputs (aka neuralese), but I would describe this differently.)
Nope, should just get run on all actions.
Yes, but each subaction within the tree search should be inspectable and both models should (by assumption) understand the situation for each action equally well.
For something to go wrong with tree search due to scheming, you need to have the internal model take a bad action at some point. At this point, you’d hope the monitor would be able to point this out. (Where this bad action could be either assigning a problematic rating or failing to explore something desirable.)
Another overall reaction I have to your comment:
Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.
The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.