As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
As far as I understand “aligning the AI to an instinct”, and “carefully engineered relational principles”, the latter might look like “have the AI solve problems that humans actually cannot solve by themselves AND teach the humans how to solve them so that they or each human taught would increase the set of problems they can solve by themselves”. A Friendly AI in the broader sense is just thought to solve humanity’s problems (e.g. establish a post-work future, which my proposal doesn’t).
As for aligning the AI to an instinct, instincts are known to be easily hackable. However, I think that the right instincts can alter the AIs’ worldview in the necessary direction (e.g. my proposal of training the AI to help weaker AIs could generalize to helping the humans as well) or make the AIs worse at hiding misalignment of themselves or of their creations.
For example, if the AIs are trained to be harsh and honest critiques,[1] then in the AI-2027 forecast Agent-3 might have pointed out that, say, a lack of substantial oversight would let instumental convergence sneak adversarial misalignment in. Or that Agent-3 copies don’t understand how the AIs are to be aligned to serve humans, not to help the humans become more self-reliant as described above.
Which was explicitly done by the KimiK2 team.