Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale
I’m curious why you think this? It seems like there’s some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I’m wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).
I think the main reason to expect cognitive oversight to scale better is that, because you’re reading intermediate computations as opposed to behaviors, the AI isn’t as capable of manipulating how they look even after you optimizing against the cognitive oversight. In the limit of fine-grained cogntivie oversight, the computations that led to your reading simply aren’t expressive enough to fool you.
Training against cognitive oversight’s signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain’s strategy from AI-2027′s Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.
Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
I’m curious why you think this? It seems like there’s some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I’m wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).
I think the main reason to expect cognitive oversight to scale better is that, because you’re reading intermediate computations as opposed to behaviors, the AI isn’t as capable of manipulating how they look even after you optimizing against the cognitive oversight. In the limit of fine-grained cogntivie oversight, the computations that led to your reading simply aren’t expressive enough to fool you.
Training against cognitive oversight’s signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain’s strategy from AI-2027′s Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.