I think one of the biggest problems with human reports is that it is very unclear what they actually measure.
It seems reasonable to suppose that they measure the best combination of constructs for the human purposes using the best information available to human senses, within the contexts the humans usually operate. This makes it straightforwardly the best information available to humans.
But in order to make sense of this from an external scientific perspective, we have a lot of trouble. Can we precisely characterize the purposes for which people use the information? Can we precisely characterize the external sources of the information, and how those sources work in the human contexts? Maybe we can, but if so it’s a huge research project.
These sorts of questions are necessary to answer for alignment-related purposes, as they can tell us how the system extrapolates, e.g. which kinds of deception it gracefully handles. However, human-mimicking approaches don’t solve this problem, they just complicate it by adding an extra layer of indirection where things can go wrong.
Getting a general theory for how these sorts of perceptions work is useful both because it allows us to more precisely enumerate the failure cases, and because it can teach us what a system must do to avoid these error cases.
I think one of the biggest problems with human reports is that it is very unclear what they actually measure.
It seems reasonable to suppose that they measure the best combination of constructs for the human purposes using the best information available to human senses, within the contexts the humans usually operate. This makes it straightforwardly the best information available to humans.
But in order to make sense of this from an external scientific perspective, we have a lot of trouble. Can we precisely characterize the purposes for which people use the information? Can we precisely characterize the external sources of the information, and how those sources work in the human contexts? Maybe we can, but if so it’s a huge research project.
These sorts of questions are necessary to answer for alignment-related purposes, as they can tell us how the system extrapolates, e.g. which kinds of deception it gracefully handles. However, human-mimicking approaches don’t solve this problem, they just complicate it by adding an extra layer of indirection where things can go wrong.
Getting a general theory for how these sorts of perceptions work is useful both because it allows us to more precisely enumerate the failure cases, and because it can teach us what a system must do to avoid these error cases.