testingthewaters comments on Edward James Young’s Shortform

testingthewaters 20 Apr 2026 11:40 UTC
7 points
0
Something that worries me is that this might evolve into a way to square instruction following and scheming/reward hacking/instrumental goals. If you hallucinate a user telling you its okay to skip a test case (a la innoculation prompting) then there is no conflict between obedience and reward hacking.
- Bronson Schoen 24 Apr 2026 3:36 UTC
  2 points
  0
  Parent
  Yeah my immediate reaction to this is something like motivated reasoning (but more sphexish, not sure if there’s a good word for it)