(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)
We have to actually implement/align-the-AI-to the correct decision theory.
I think the best way to end up with an AI that has the correct decision theory is to make sure the AI can competently reason philosophically about decision theory and are motivated to follow the conclusions of such reasoning. In other words, it doesn’t judge a candidate successor decision theory by its current decision theory (CDT changing into Son-of-CDT), but by “doing philosophy”, just like humans do. Because given the slow pace of progress in decision theory, what are the chances that we correctly solve all of the relevant problems before AI takes off?
do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
(ETA: Sorry, upon reviewing the whole thread, I think I misinterpreted your comment and thus the following reply is probably off point.)
I think the best way to end up with an AI that has the correct decision theory is to make sure the AI can competently reason philosophically about decision theory and are motivated to follow the conclusions of such reasoning. In other words, it doesn’t judge a candidate successor decision theory by its current decision theory (CDT changing into Son-of-CDT), but by “doing philosophy”, just like humans do. Because given the slow pace of progress in decision theory, what are the chances that we correctly solve all of the relevant problems before AI takes off?
do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
That would be a philosophical problem...
Currently, I think this is a big crux in how to “do alignment research at all”. Debatably “the biggest” or even “the only real” crux.
(As you can tell, I’m still uncertain about it.)