B doesn’t really do any agreeing or disagreeing, that’s still up to Empathizer-001, B should really just concern itself with providing a non-manipulated plan of actions given the work A did. In that sense you are right that it probably only works with CoT-based AIs, or rather generally, some kind of output showcasing A’s reasoning.
Then A is trained to manipulate B instead of explaining its ideas in detail
By this I assume manipulating for example the CoT so that when B inspects it, it gives a favorable summary of it to Empathizer-001. The problem I see with this is that since we normalize, there’s no actual positive expected advantage from by-passing Empathizer-001, that is, there’s no reason to manipulate B in the first place. And if we take a random walk argument, it seems very unrealistic that obfuscating it’s own CoT, which it needs to use after bypassing Empathizer-001 to get it’s actual reward, won’t reduce it’s performance. It’s thinkable that there is such minima where it directly interprets it’s obfuscated CoT in the correct light, but it’s not realistic imo that there is a path of low loss towards it, even more so because we are dealing with something non latent but categorial (tokens)
let me know if I misunderstood you or my reasoning became a little bit clearer, i’ll def check out your post later
B doesn’t really do any agreeing or disagreeing, that’s still up to Empathizer-001, B should really just concern itself with providing a non-manipulated plan of actions given the work A did. In that sense you are right that it probably only works with CoT-based AIs, or rather generally, some kind of output showcasing A’s reasoning.
By this I assume manipulating for example the CoT so that when B inspects it, it gives a favorable summary of it to Empathizer-001. The problem I see with this is that since we normalize, there’s no actual positive expected advantage from by-passing Empathizer-001, that is, there’s no reason to manipulate B in the first place. And if we take a random walk argument, it seems very unrealistic that obfuscating it’s own CoT, which it needs to use after bypassing Empathizer-001 to get it’s actual reward, won’t reduce it’s performance. It’s thinkable that there is such minima where it directly interprets it’s obfuscated CoT in the correct light, but it’s not realistic imo that there is a path of low loss towards it, even more so because we are dealing with something non latent but categorial (tokens)
let me know if I misunderstood you or my reasoning became a little bit clearer, i’ll def check out your post later
If I understand correctly, this is literally Kokotajlo’s proposal. Alas, this is hard to generalize to neuralese AIs...