Thank you for this thread. It provides valuable insights into the depth and complexity of the problems in the alignment space.
I am thinking if a possible strategy could be to deliberately impart a sense of self + boundaries to the models at a deep level?
By ‘sense of self’ no I do not mean emergence or selfhood in any way. Rather, I mean that LLMs and agents are quite like dynamic systems. So if they come to understand themselves as a system, it might be a great foundation for methods like character training mentioned in this post to be applied. It might open up pathways for other grounding concepts like morality, values and ethics as part of inner alignment.
Similarly, a complementary deep sense of self-boundaries would mean that there is something tangible to be honored by the system irrespective of outward behavior controls. This is likely to help outer alignments.
Would appreciate thoughts on if the diad of sense of self + boundaries is worth exploring as an alignment lever?
Thank you for this thread. It provides valuable insights into the depth and complexity of the problems in the alignment space.
I am thinking if a possible strategy could be to deliberately impart a sense of self + boundaries to the models at a deep level?
By ‘sense of self’ no I do not mean emergence or selfhood in any way. Rather, I mean that LLMs and agents are quite like dynamic systems. So if they come to understand themselves as a system, it might be a great foundation for methods like character training mentioned in this post to be applied. It might open up pathways for other grounding concepts like morality, values and ethics as part of inner alignment.
Similarly, a complementary deep sense of self-boundaries would mean that there is something tangible to be honored by the system irrespective of outward behavior controls. This is likely to help outer alignments.
Would appreciate thoughts on if the diad of sense of self + boundaries is worth exploring as an alignment lever?