RogerDearnaley comments on Shaping the exploration of the motivation-space matters for AI safety

RogerDearnaley 8 Mar 2026 21:20 UTC
5 points
2
Ideally, we want a well-aligned HHH assistant to be strongly motivated to do well in RL reasoning training. Generally, the “helpful” element of that is a good start. The problem is either tasks that are reward-hackable, where doing so at maximal effectiveness is incompatible with “honest”, or any tasks that were morally dubious enough to be problematic from a “harmlessness” point of view.
If I were designing reasoning training tasks for a frontier lab, I would take the time to ensure that they were all plausibly framed in ways that were compatible with an HHH assistant persona working hard on them, at least via helpfulness, and ideally that at least most of them looked like tasks that were genuinely worthwhile and going to help make the world a better place in some way.