Something that worries me is that this might evolve into a way to square instruction following and scheming/reward hacking/instrumental goals. If you hallucinate a user telling you its okay to skip a test case (a la innoculation prompting) then there is no conflict between obedience and reward hacking.
Something that worries me is that this might evolve into a way to square instruction following and scheming/reward hacking/instrumental goals. If you hallucinate a user telling you its okay to skip a test case (a la innoculation prompting) then there is no conflict between obedience and reward hacking.
Yeah my immediate reaction to this is something like motivated reasoning (but more sphexish, not sure if there’s a good word for it)