Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior.
Based on what? Problematic behavior avoidance does actually generalize in practice, right?
I don’t get why actual ability matters. It’s sufficiently capable to pull it off in some simulated environments. Are you claiming that we can’t decieve GPT-4 and it is actually waiting and playing along just because it can’t really win?