I’m reminded of a Sanskrit verse “Vidya dadati vinayam, vinayodyati patratam” which translates to intelligence gives power, but humility gives guidance. Applied to AI, intelligence alone doesn’t ensure alignment, just as humans aren’t automatically prosocial. What matters are the high-level principles we embed to guide behaviour toward repairable, cooperative, and trustable interactions, which we do see in long-term relationships built on shared values.
The architecture-level challenge of making AI reliably follow such principles is hard, yes, especially under extreme power asymmetry, but agreeing on relational alignment first is a necessary first step. Master/servant models may seem safe, but I believe carefully engineered relational principles offer a more robust and sustainable path.
I mean, master/servant is a relation. I think if you managed to enforce it rigorously, the biggest risk from it would be humans “playing themselves”—just as we’ve done until now, only with far greater power. For example basically falling into wireheading out of pursuit of enjoyment, etc.
I believe carefully engineered relational principles offer a more robust and sustainable path
Can you sketch a broad example of how such a thing would look like? How does it differ from example from the classic image of a Friendly AI (FAI)?
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
I don’t mean this as a technical solution, more a direction to start thinking in.
Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function.
This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization.
While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.
I’m reminded of a Sanskrit verse “Vidya dadati vinayam, vinayodyati patratam” which translates to intelligence gives power, but humility gives guidance. Applied to AI, intelligence alone doesn’t ensure alignment, just as humans aren’t automatically prosocial. What matters are the high-level principles we embed to guide behaviour toward repairable, cooperative, and trustable interactions, which we do see in long-term relationships built on shared values.
The architecture-level challenge of making AI reliably follow such principles is hard, yes, especially under extreme power asymmetry, but agreeing on relational alignment first is a necessary first step. Master/servant models may seem safe, but I believe carefully engineered relational principles offer a more robust and sustainable path.
I mean, master/servant is a relation. I think if you managed to enforce it rigorously, the biggest risk from it would be humans “playing themselves”—just as we’ve done until now, only with far greater power. For example basically falling into wireheading out of pursuit of enjoyment, etc.
Can you sketch a broad example of how such a thing would look like? How does it differ from example from the classic image of a Friendly AI (FAI)?
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
Which was explicitly done by the KimiK2 team.
I don’t mean this as a technical solution, more a direction to start thinking in.
Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function.
This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization.
While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.