The fact that constitutional AI works at all, that we can point at these abstract concepts like ‘freedom’ and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
“It understands but it doesn’t care!”
There is this bizarre motte-and-bailey people seem to do around this subject.
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model.
It seems like you think that human preferences are only being “predicted” by GPT-4, and not “preferred.” If so, why do you think that?
I commonly encounter people expressing sentiments like “prosaic alignment work isn’t real alignment, because we aren’t actually getting the AI to care about X.” To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
The value of constitutional AI is using simulations of humans to rate an AI’s outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don’t think this will work once AI’s become smarter than humans. At that point, the human simulations will have trouble evaluating AI’s just like humans do.
Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.
Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it. Humans are born into a world with others in power, so we know that most humans care about each other without knowing why. AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding.
A strong theory about AI internals should come first. Refuting Yudkowsky’s theory about how it might go wrong is irrelevant.
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that’s currently unsubstantiated should cause them to down-update on AI risk. That’s why it’s relevant. Although I think we should have good theories of AI internals.
I know I reacted to this comment, but I want to emphasize that this:
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior,
Is to first order arguably the entire AI risk argument, that is if we make the assumption that the external behavior gives strong evidence about it’s internal structure, then there is no reason to elevate the AI risk argument at all, given the probably aligned behavior of GPTs when using RLHF.
More generally, the stronger the connection between external behavior and internal goals, the less worried you should be about AI safety, and this is a partial disagreement with people that are more pessimistic, albeit I have other disagreements there.
Humans are born into a world with others in power, so we know that most humans care about each other without knowing why.
I think the actual reason we believe humans could care about each other is because we’ve evolved the ability to do so, and that most humans share the same brain structure, and therefore the same tendency to care for people they consider their “ingroup”.
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
The value of constitutional AI is using simulations of humans to rate an AI’s outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don’t think this will work once AI’s become smarter than humans. At that point, the human simulations will have trouble evaluating AI’s just like humans do.
Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.
We do not know, that is the relevant problem.
Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it.
Humans are born into a world with others in power, so we know that most humans care about each other without knowing why.
AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding.
A strong theory about AI internals should come first. Refuting Yudkowsky’s theory about how it might go wrong is irrelevant.
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that’s currently unsubstantiated should cause them to down-update on AI risk. That’s why it’s relevant. Although I think we should have good theories of AI internals.
I know I reacted to this comment, but I want to emphasize that this:
Is to first order arguably the entire AI risk argument, that is if we make the assumption that the external behavior gives strong evidence about it’s internal structure, then there is no reason to elevate the AI risk argument at all, given the probably aligned behavior of GPTs when using RLHF.
More generally, the stronger the connection between external behavior and internal goals, the less worried you should be about AI safety, and this is a partial disagreement with people that are more pessimistic, albeit I have other disagreements there.
I think the actual reason we believe humans could care about each other is because we’ve evolved the ability to do so, and that most humans share the same brain structure, and therefore the same tendency to care for people they consider their “ingroup”.