I’d like to gain clarity on what we think the relationship should be between AI alignment and agent foundations. To me, the relationship is 1) historical, in that the people bringing about the field of agent foundations are coming from the AI alignment community and 2) motivational, in that the reason they’re investigating agent foundations is to make progress on AI alignment, but not 3) technical, in that I think agent foundations should not be about directly answering questions of how to make the development of AI beneficial to humanity. I think it makes more sense to pursue agent foundations as a quest to understand the nature of agents as a technical concept in its own right.
If you are a climate scientist, then you are very likely in the field in order to help humanity reduce the harms from climate change. But on a day-to-day basis, the thing you are doing is trying to understand the underlying patterns and behavior of the climate as a physical system. It would be unnatural to e.g. exclude papers from climate science journals on the grounds of not being clearly applicable to reducing climate change.
For agent foundations, I think some of the core questions revolve around things like, how does having goals work? How stable are goals? How retargetable are goals? Can we make systems that optimize strongly but within certain limitations? But none of those question are are directly about aligning the goals with humanity.
There’s also another group of questions like, what are human’s goals? How can we tell? How complex and fragile are they? How can we get an AI system to imitate a human? Et cetera. But I think these questions come from a field that is not agent foundations.
There should certainly be constant and heavy communication between these fields. And I also think that even individual people should be thinking about the applicability questions. But they’re somewhat separate loops. A climate scientist will have an outer loop that does things like, chooses a research problem because they think the answer might help reduce climate change, and they should keep checking on that belief as they perform their research. But while they’re doing their research, I think they should generally be using an inner loop that just thinks, “huh, how does this funny ‘climate’ thing work?”
I’d like to gain clarity on what we think the relationship should be between AI alignment and agent foundations. To me, the relationship is 1) historical, in that the people bringing about the field of agent foundations are coming from the AI alignment community and 2) motivational, in that the reason they’re investigating agent foundations is to make progress on AI alignment, but not 3) technical, in that I think agent foundations should not be about directly answering questions of how to make the development of AI beneficial to humanity. I think it makes more sense to pursue agent foundations as a quest to understand the nature of agents as a technical concept in its own right.
If you are a climate scientist, then you are very likely in the field in order to help humanity reduce the harms from climate change. But on a day-to-day basis, the thing you are doing is trying to understand the underlying patterns and behavior of the climate as a physical system. It would be unnatural to e.g. exclude papers from climate science journals on the grounds of not being clearly applicable to reducing climate change.
For agent foundations, I think some of the core questions revolve around things like, how does having goals work? How stable are goals? How retargetable are goals? Can we make systems that optimize strongly but within certain limitations? But none of those question are are directly about aligning the goals with humanity.
There’s also another group of questions like, what are human’s goals? How can we tell? How complex and fragile are they? How can we get an AI system to imitate a human? Et cetera. But I think these questions come from a field that is not agent foundations.
There should certainly be constant and heavy communication between these fields. And I also think that even individual people should be thinking about the applicability questions. But they’re somewhat separate loops. A climate scientist will have an outer loop that does things like, chooses a research problem because they think the answer might help reduce climate change, and they should keep checking on that belief as they perform their research. But while they’re doing their research, I think they should generally be using an inner loop that just thinks, “huh, how does this funny ‘climate’ thing work?”