It seems to me that we need to understand metaphilosphy well enough to be able to write down a white-box algorithm for it, before we can be reasonably confident that the AI will correctly solve every philosophical problem that it eventually comes across. If we just teach an AI how to do philosophy without an explicit understanding of it in the form of an algorithm, how do we know that the AI has fully learned it (and not some subtly wrong version of doing philosophy)?
Once we are able to write down a white-box algorithm, wouldn’t it be safer to implement, test, and debug the algorithm directly as part of an AI designed from the start to take advantage of the algorithm, rather than indirectly having an AI learn it (and then presumably verifying that its internal representation of the algorithm is correct and there aren’t any potentially bad interactions with the rest of the AI)? And even the latter could reasonably be called white-box also since you are actually looking inside the AI and making sure that it has the right stuff inside. I was mainly arguing against a purely black box approach, where we start to build AIs while having little understanding of metaphilosophy, and therefore can’t look inside the AI to see if has learned the right thing.
It seems to me that we need to understand metaphilosphy well enough to be able to write down a white-box algorithm for it, before we can be reasonably confident that the AI will correctly solve every philosophical problem that it eventually comes across. If we just teach an AI how to do philosophy without an explicit understanding of it in the form of an algorithm, how do we know that the AI has fully learned it (and not some subtly wrong version of doing philosophy)?
Once we are able to write down a white-box algorithm, wouldn’t it be safer to implement, test, and debug the algorithm directly as part of an AI designed from the start to take advantage of the algorithm, rather than indirectly having an AI learn it (and then presumably verifying that its internal representation of the algorithm is correct and there aren’t any potentially bad interactions with the rest of the AI)? And even the latter could reasonably be called white-box also since you are actually looking inside the AI and making sure that it has the right stuff inside. I was mainly arguing against a purely black box approach, where we start to build AIs while having little understanding of metaphilosophy, and therefore can’t look inside the AI to see if has learned the right thing.