do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
do you have thoughts on how to encode “doing philosophy” in a way that we would expect to be strongly convergent, such that if implemented on the last ai humans ever control, we can trust the process after disempowerment to continue to be usefully doing philosophy in some nailed down way?
I think we’re really far from having a good enough understanding of what “philosophy” is, or what “doing philosophy” consists of, to be able to do that. (Aside from “indirect” methods that pass the buck to simulated humans, that Pi Rogers also mentioned in another reply to you.)
Here is my current best understanding of what philosophy is, so you can have some idea of how far we are from what you’re asking.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
That would be a philosophical problem...