I’m with Davidmanheim here, it seems this idea could benefit from reading in measurement theory, or at least recognizing a discrepancy that undermines the analogy. I’ll get into that a bit, but to start, the post was definitely positive food for thought.
If you’re measuring actual temperature, you have some measure options there too, but fundamentally it’s a quality of the material under study. If you’re measuring “the” perceived temperature, it’s an interaction between “the average person” and the material, and sticking fingers in is probably a good measure. Yes, temperature and perceived temperature will correlate, but if the thing you’re measuring exists only in someone’s head, you’re going to have to go to their head for the measurement (also see psychophysics).
“Train[ing] a net to replicate human reports” is not obviously less useful than “actual” scales. Human reports may in fact be the most construct-valid measure. (Though I do agree that leaving these reports in the form of natural language rather than attempted quantifications would indeed be ambiguous, and if we lack face-valid quantitative measures, we will have to develop them from somewhere, probably with those open-ended responses as a foundation.)
Although human reports may be noisy, so are all measures. The thermometer has an implicit +/- margin of error. It seems very precise to us, but human judgments of attributes can also be reliable (in that lots of people agree) and precise (in that the error bars are narrow). For example, if I asked a lot of people to rate the perceived precision of various measures on a scale of 1=extremely noisy to 100=extremely precise, I expect there to be a decent amount of consistency in the rank ordering of those ratings, for thermometers to score highly, and for at least some of the average perceived precisions to flash pleasantly narrow error bars.
But because even the lowest-variance perceptions vary a lot between people (vs. the variability in temperature readings from a thermometer), I do suspect you’re not going to get readings that are “approximately-deterministically” useful indicators for lots of perceptual domains, such as alignment. But you’ll get indicators that “far-from-deterministically-but-reliably” predict variance in criterion variables. In the end, we’re pessimistic and optimistic about the same things; I just don’t think it’s because human reports are inherently the wrong tool, it’s because the attribute of interest is a psychological construct rather than a conveniently-precisely-measurable-physical property. Again, the post was good food for thought—just as measurement of temperature has improved and gotten more precise (touch it → use mercury → use radiation), maybe the methods we use for psychological measurement will develop and improve, with hope for alignment.
I think one of the biggest problems with human reports is that it is very unclear what they actually measure.
It seems reasonable to suppose that they measure the best combination of constructs for the human purposes using the best information available to human senses, within the contexts the humans usually operate. This makes it straightforwardly the best information available to humans.
But in order to make sense of this from an external scientific perspective, we have a lot of trouble. Can we precisely characterize the purposes for which people use the information? Can we precisely characterize the external sources of the information, and how those sources work in the human contexts? Maybe we can, but if so it’s a huge research project.
These sorts of questions are necessary to answer for alignment-related purposes, as they can tell us how the system extrapolates, e.g. which kinds of deception it gracefully handles. However, human-mimicking approaches don’t solve this problem, they just complicate it by adding an extra layer of indirection where things can go wrong.
Getting a general theory for how these sorts of perceptions work is useful both because it allows us to more precisely enumerate the failure cases, and because it can teach us what a system must do to avoid these error cases.
I’m with Davidmanheim here, it seems this idea could benefit from reading in measurement theory, or at least recognizing a discrepancy that undermines the analogy. I’ll get into that a bit, but to start, the post was definitely positive food for thought.
If you’re measuring actual temperature, you have some measure options there too, but fundamentally it’s a quality of the material under study. If you’re measuring “the” perceived temperature, it’s an interaction between “the average person” and the material, and sticking fingers in is probably a good measure. Yes, temperature and perceived temperature will correlate, but if the thing you’re measuring exists only in someone’s head, you’re going to have to go to their head for the measurement (also see psychophysics).
“Train[ing] a net to replicate human reports” is not obviously less useful than “actual” scales. Human reports may in fact be the most construct-valid measure. (Though I do agree that leaving these reports in the form of natural language rather than attempted quantifications would indeed be ambiguous, and if we lack face-valid quantitative measures, we will have to develop them from somewhere, probably with those open-ended responses as a foundation.)
Although human reports may be noisy, so are all measures. The thermometer has an implicit +/- margin of error. It seems very precise to us, but human judgments of attributes can also be reliable (in that lots of people agree) and precise (in that the error bars are narrow). For example, if I asked a lot of people to rate the perceived precision of various measures on a scale of 1=extremely noisy to 100=extremely precise, I expect there to be a decent amount of consistency in the rank ordering of those ratings, for thermometers to score highly, and for at least some of the average perceived precisions to flash pleasantly narrow error bars.
But because even the lowest-variance perceptions vary a lot between people (vs. the variability in temperature readings from a thermometer), I do suspect you’re not going to get readings that are “approximately-deterministically” useful indicators for lots of perceptual domains, such as alignment. But you’ll get indicators that “far-from-deterministically-but-reliably” predict variance in criterion variables. In the end, we’re pessimistic and optimistic about the same things; I just don’t think it’s because human reports are inherently the wrong tool, it’s because the attribute of interest is a psychological construct rather than a conveniently-precisely-measurable-physical property. Again, the post was good food for thought—just as measurement of temperature has improved and gotten more precise (touch it → use mercury → use radiation), maybe the methods we use for psychological measurement will develop and improve, with hope for alignment.
I think one of the biggest problems with human reports is that it is very unclear what they actually measure.
It seems reasonable to suppose that they measure the best combination of constructs for the human purposes using the best information available to human senses, within the contexts the humans usually operate. This makes it straightforwardly the best information available to humans.
But in order to make sense of this from an external scientific perspective, we have a lot of trouble. Can we precisely characterize the purposes for which people use the information? Can we precisely characterize the external sources of the information, and how those sources work in the human contexts? Maybe we can, but if so it’s a huge research project.
These sorts of questions are necessary to answer for alignment-related purposes, as they can tell us how the system extrapolates, e.g. which kinds of deception it gracefully handles. However, human-mimicking approaches don’t solve this problem, they just complicate it by adding an extra layer of indirection where things can go wrong.
Getting a general theory for how these sorts of perceptions work is useful both because it allows us to more precisely enumerate the failure cases, and because it can teach us what a system must do to avoid these error cases.