Ideally this is true whether it’s in a simulation or real life
Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
I think it should be the AI Lab’s job to filter out that kind of data from the next training run.
I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.
Agreed. In practice, though, I worry that if you make it salient that sometimes AIs do ambiguously immoral things in situations where they thought they were being tested on ability to accomplish a goal but were actually being tested on propensity to behave morally, AI companies will see that and create RL envs to better teach AI agents to distinguish cases where the testers are trying to elicit capabilities from cases where testers are trying to elicit moral behavior.
I agree normatively, but also that sentence makes me want a react with the mood “lol of despair”.