Thanks for the great reply :) I think we do disagree after all.
humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans
Except about that—here we agree.
Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values.
This might be summarized as “If humans are inaccurate, let’s strive to make them more accurate.”
I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren’t actually that consistent, even in what we’d consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of ‘accuracy’ with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).
Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.
A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)
Thanks for another thoughtful response and explaining further. I think we can now both agree that we disagree (at least in certain respects) ;-)
We take seriously your argument that AI could get really smart and good at predicting human preferences and values, which could change the level of human involvement in training, evaluation, and monitoring. However, if we go with the approach you propose:
> Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
> A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
The question arises:
How will we know if AI has learned a notion of human values that’s robust to inconsistency and disagreement and that its predictions are accurate?
We would argue some form of human input would be needed to evaluate what the AI has learned. Though this input need not be prompt-response feedback typical of current RLHF approaches.
If this evaluation reveals that the AI is indeed accurate (whatever that may mean for the particular product and context in question), then we agree that further human input could be more limited. Though continual training, evaluation, and monitoring with humans in the loop in some capacity will likely be needed since values change over time and to ensure that the system has not drifted.
> (And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)
We are hesitant to take an approach of AI paternalism where we assume the AI knows best and ignore human disagreement, though there may be deployment contexts where that is appropriate for safety. Though note that our argument is focused more on human involvement in training, evaluation, and monitoring than real-time decisions during deployment. As AI gets smarter, even if these systems can perfectly predict human values and preferences, they could also learn to collude, deceive, and sabotage. For example, if they develop situational awareness, they could behave differently at deployment time than at training time. We agree that there are risks to enabling human control, but abdicating all control to the AI is also risky. This is why we argue for human-AI complementarity – leveraging the strengths of both types of intelligence may lead to a more robust signal for training, evaluation, and monitoring than relying on AI or humans alone.
Thanks for the great reply :) I think we do disagree after all.
Except about that—here we agree.
This might be summarized as “If humans are inaccurate, let’s strive to make them more accurate.”
I think this, as a research priority or plan A, is doomed by a confluence of practical facts (humans aren’t actually that consistent, even in what we’d consider a neutral setting) and philosophical problems (What if I think the snap judgments and heuristics are important parts of being human? And, how do you square a univariate notion of ‘accuracy’ with the sensitivity of human conclusions to semi-arbitrary changes to e.g. their reading lists, or the framings of arguments presented to them?).
Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
(And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)
Thanks for another thoughtful response and explaining further. I think we can now both agree that we disagree (at least in certain respects) ;-)
We take seriously your argument that AI could get really smart and good at predicting human preferences and values, which could change the level of human involvement in training, evaluation, and monitoring. However, if we go with the approach you propose:
> Instead, I think our strategy should be “If humans are inconsistent and disagree, let’s strive to learn a notion of human values that’s robust to our inconsistency and disagreement.”
> A committee of humans reviewing an AI’s proposal is, ultimately, a physical system that can be predicted. If you have an AI that’s good at predicting physical systems, then before it makes an important decision it can just predict this Committee(time, proposal) system and treat the predicted output as feedback on its proposal. If the prediction is accurate, then actual humans meeting in committee is unnecessary.
The question arises:
How will we know if AI has learned a notion of human values that’s robust to inconsistency and disagreement and that its predictions are accurate?
We would argue some form of human input would be needed to evaluate what the AI has learned. Though this input need not be prompt-response feedback typical of current RLHF approaches.
If this evaluation reveals that the AI is indeed accurate (whatever that may mean for the particular product and context in question), then we agree that further human input could be more limited. Though continual training, evaluation, and monitoring with humans in the loop in some capacity will likely be needed since values change over time and to ensure that the system has not drifted.
> (And indeed, putting human control of the AI in the physical world actually exposes it to more manipulation than if the control is safely ensconced in the logical structure of the AI’s decision-making.)
We are hesitant to take an approach of AI paternalism where we assume the AI knows best and ignore human disagreement, though there may be deployment contexts where that is appropriate for safety. Though note that our argument is focused more on human involvement in training, evaluation, and monitoring than real-time decisions during deployment. As AI gets smarter, even if these systems can perfectly predict human values and preferences, they could also learn to collude, deceive, and sabotage. For example, if they develop situational awareness, they could behave differently at deployment time than at training time. We agree that there are risks to enabling human control, but abdicating all control to the AI is also risky. This is why we argue for human-AI complementarity – leveraging the strengths of both types of intelligence may lead to a more robust signal for training, evaluation, and monitoring than relying on AI or humans alone.
~ Sophie Bridgers (on behalf of the authors)