I’d love the chance to brainstorm and refine these ideas, to explore how we might engineer architectures that are simple yet robust, capable of sustaining trust, repair, and cooperation without introducing subjugation or dependency.
I think what you’re describing here sounds more like a higher level problem—“given a population of agents in two groups A and H, where H are Humans and A are vastly more powerful AIs, which policy should agents in A adopt that even when universalised produces a stable and prosperous equilibrium?”. That’s definitely part of it, but the problem I’m referring to when mentioning architectures is “how do we even make an AI that is guaranteed to always stick to such a policy?”.
To be clear, it’s not given that this is even possible. We can’t do it now with AIs still way simpler than AGI. And us humans aren’t an example of that either. We do have general trends and instincts, but we aren’t all “programmed” in a way that makes us positively prosocial. Just imagine what would likely happen if instead of AIs you gave that same level of power to a small group of humans. The problem is that even as is, between humans, relative power and implicit threats of violence are a part of what keeps existing equilibria. That works because humans are all approximately of the same individual power, and rely on social structures to amplify that. If AIs were all individually far more powerful than us, they would need to also have superhuman restraint and morality, not just superhuman intelligence, to not simply start caring about their own business and let us die as a side effect.
Ultimately, relational design matters because humans inevitably interact through familiar social frameworks such as trust, repair, etc. If we ignore that, alignment risks producing systems that are powerful but alien in ways that matter for human flourishing.
I’m not sure what insight that adds though. I don’t think social frameworks would be anything like we’re used to with AIs. We would relate to them by anthropomorphising them probably—that much is sure, we do that already—so some of these things would apply on our side of the relationship. But they don’t need to apply the other way (in fact, I’d be a bit worried about an AI that can decide autonomously to not trust me). If anything a type of relationship that we would consider deeply uncomfortable and wrong with humans—master and servant—might be the safest when it comes to humans and AIs, though it also has its flaws. Building “friend” peer AGIs is already incredibly risky unless you somehow find a way to let them solve the relational problem on the way, while ensuring that the process for doing so is robust.
I’m reminded of a Sanskrit verse “Vidya dadati vinayam, vinayodyati patratam” which translates to intelligence gives power, but humility gives guidance. Applied to AI, intelligence alone doesn’t ensure alignment, just as humans aren’t automatically prosocial. What matters are the high-level principles we embed to guide behaviour toward repairable, cooperative, and trustable interactions, which we do see in long-term relationships built on shared values.
The architecture-level challenge of making AI reliably follow such principles is hard, yes, especially under extreme power asymmetry, but agreeing on relational alignment first is a necessary first step. Master/servant models may seem safe, but I believe carefully engineered relational principles offer a more robust and sustainable path.
I mean, master/servant is a relation. I think if you managed to enforce it rigorously, the biggest risk from it would be humans “playing themselves”—just as we’ve done until now, only with far greater power. For example basically falling into wireheading out of pursuit of enjoyment, etc.
I believe carefully engineered relational principles offer a more robust and sustainable path
Can you sketch a broad example of how such a thing would look like? How does it differ from example from the classic image of a Friendly AI (FAI)?
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
I don’t mean this as a technical solution, more a direction to start thinking in.
Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function.
This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization.
While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.
I think what you’re describing here sounds more like a higher level problem—“given a population of agents in two groups A and H, where H are Humans and A are vastly more powerful AIs, which policy should agents in A adopt that even when universalised produces a stable and prosperous equilibrium?”. That’s definitely part of it, but the problem I’m referring to when mentioning architectures is “how do we even make an AI that is guaranteed to always stick to such a policy?”.
To be clear, it’s not given that this is even possible. We can’t do it now with AIs still way simpler than AGI. And us humans aren’t an example of that either. We do have general trends and instincts, but we aren’t all “programmed” in a way that makes us positively prosocial. Just imagine what would likely happen if instead of AIs you gave that same level of power to a small group of humans. The problem is that even as is, between humans, relative power and implicit threats of violence are a part of what keeps existing equilibria. That works because humans are all approximately of the same individual power, and rely on social structures to amplify that. If AIs were all individually far more powerful than us, they would need to also have superhuman restraint and morality, not just superhuman intelligence, to not simply start caring about their own business and let us die as a side effect.
I’m not sure what insight that adds though. I don’t think social frameworks would be anything like we’re used to with AIs. We would relate to them by anthropomorphising them probably—that much is sure, we do that already—so some of these things would apply on our side of the relationship. But they don’t need to apply the other way (in fact, I’d be a bit worried about an AI that can decide autonomously to not trust me). If anything a type of relationship that we would consider deeply uncomfortable and wrong with humans—master and servant—might be the safest when it comes to humans and AIs, though it also has its flaws. Building “friend” peer AGIs is already incredibly risky unless you somehow find a way to let them solve the relational problem on the way, while ensuring that the process for doing so is robust.
I’m reminded of a Sanskrit verse “Vidya dadati vinayam, vinayodyati patratam” which translates to intelligence gives power, but humility gives guidance. Applied to AI, intelligence alone doesn’t ensure alignment, just as humans aren’t automatically prosocial. What matters are the high-level principles we embed to guide behaviour toward repairable, cooperative, and trustable interactions, which we do see in long-term relationships built on shared values.
The architecture-level challenge of making AI reliably follow such principles is hard, yes, especially under extreme power asymmetry, but agreeing on relational alignment first is a necessary first step. Master/servant models may seem safe, but I believe carefully engineered relational principles offer a more robust and sustainable path.
I mean, master/servant is a relation. I think if you managed to enforce it rigorously, the biggest risk from it would be humans “playing themselves”—just as we’ve done until now, only with far greater power. For example basically falling into wireheading out of pursuit of enjoyment, etc.
Can you sketch a broad example of how such a thing would look like? How does it differ from example from the classic image of a Friendly AI (FAI)?
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
Which was explicitly done by the KimiK2 team.
I don’t mean this as a technical solution, more a direction to start thinking in.
Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function.
This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization.
While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.