Training against cognitive oversight’s signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain’s strategy from AI-2027′s Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.
Training against cognitive oversight’s signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain’s strategy from AI-2027′s Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
What you described is an example of training based on cognitive oversight in my view. When I said ‘there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others’ I was thinking of things like this. There’s a spectrum from ‘just make the rate at which the evilness detector fires a term in the loss function’ to ‘hill climb on your evilness detector (but don’t use it during training)’ to ‘when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem’ all the way to ‘when the evilness detector fires, demand a global moratorium, burn all the gpus, and eventually try a new paradigm years later (repeat until you find a paradigm where that paradigm’s version of the evilness detector doesn’t fire)’. in each of these, there is some pressure for cognitive obfuscation, but each approach has less pressure than the earlier ones imo (and it’s ultimately a race between optimistation pressure for alignment and optimisation pressure for obfuscation). I don’t see what is the point of building a cognitive oversight technique if you aren’t going to take actions on the basis of it!
My actual view on what the right amount of optimisation against cognitive oversight is, is a complicated mess with lots of parts: something like have a bunch of techniques, train on ~none of them (unless you have loads and they’re not super correlated), hill climb on some of them, and reserve the highest effort, highest quality ones for sounding the alarm if we aren’t succeeding at aligning the models.