Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.