Nora Belrose answers Seriously, what goes wrong with “reward the agent when it makes you smile”?

Nora Belrose 11 Aug 2022 22:41 UTC
6 points
2
I don’t know what the agent would end up valuing in this scenario either. I think a pretty good research direction for shard theory would be to closely examine the training history of the agent to find particular episodes where qualitatively new behaviors seem to form / new circuits arise in the neural net. This might allow you to identify shards, whereas it seems much harder to do that if you’re only looking at the final trained weights of the agent.