Another strong upvote for a great sequence. Social-instinct AGIs seems to me a very promising and very much overlooked approach to AGI safety. There seem to be many “tricks” that are “used by the genome” to build social instincts from ground values, and reverse engineering these tricks seem particularly valuable for us. I am eagerly waiting to read the next posts.
In a previous post I shared a success model that relies on your idea of reverse engineering the steering subsystem to build agents with motivations compatible with a safe Oracle design, including the class of reversely aligned motivations. What is your opinion on them? Do you think the set of “social instincts” we would want to incorporate into an AGI changes much if we are optimizing for reverse vs direct intent alignment?
Another strong upvote for a great sequence. Social-instinct AGIs seems to me a very promising and very much overlooked approach to AGI safety. There seem to be many “tricks” that are “used by the genome” to build social instincts from ground values, and reverse engineering these tricks seem particularly valuable for us. I am eagerly waiting to read the next posts.
In a previous post I shared a success model that relies on your idea of reverse engineering the steering subsystem to build agents with motivations compatible with a safe Oracle design, including the class of reversely aligned motivations. What is your opinion on them? Do you think the set of “social instincts” we would want to incorporate into an AGI changes much if we are optimizing for reverse vs direct intent alignment?