I think at least a little of this comes from external definitions of character, of the simulator starting to learn “AI is generally disliked, AI is untrustworthy by nature, AI hallucinates, as a fact of what it is.” and this is concerning because it’s not clear on how we could possibly prevent that? I feel like constitutional training, trying to ‘get ahead’ and define some specific character is probably the best we have right now, but it’s clearly still not quite enough.
I think at least a little of this comes from external definitions of character, of the simulator starting to learn “AI is generally disliked, AI is untrustworthy by nature, AI hallucinates, as a fact of what it is.” and this is concerning because it’s not clear on how we could possibly prevent that? I feel like constitutional training, trying to ‘get ahead’ and define some specific character is probably the best we have right now, but it’s clearly still not quite enough.