I agree that many terms are suggestive and you have to actually dissolve the term and think about the actual action of what is going on in the exact training process to understand things. If people don’t break down the term and understand the process at least somewhat mechanistically, they’ll run into trouble.
I think relevant people broadly agree about terms being suggestive and agree that this is bad; they don’t particularly dispute this. (Though probably a bunch of people think it’s less important than you do. I think these terms aren’t that bad once you’ve worked with them in a technical/ML context to a sufficient extent that you detach preexisting conceptions and think about the actual process.)
But, this is pretty different from a much stronger claim you make later:
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn’t, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it’s because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
I don’t think that “a lot of the case for AI accident risk comes down to a set of subtle word games”. (At least not in the cases for risk which seem reasonably well made to me.) And, people do really “disagree this much” about whether the case for AI accident risk comes down to word games. (But they don’t disagree this much about issues with terms being suggestive.)
It seems important to distinguish between “how bad is current word usage in terms of misleading suggestiveness” and “is the current case for AI accident risk coming down to subtle work games”. (I’m not claiming you don’t distinguish between these, I’m just claiming that arguments for the first aren’t really much evidence for the second here and that readers might miss this.)
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Separately, I agree that many specific cases for AI accident risk seem pretty poor to me. (Though the issue still doesn’t seem like a word games issue as opposed to generically sloppy reasoning or generally having bad empirical predictions.) And then ones which aren’t poor remain somewhat vague, though this is slowly improving over time.
So I basically agree with:
Do not trust this community’s concepts and memes, if you have the time.
Edit: except that I do think the general take of “holy shit AI (and maybe the singularity), that might be a really big deal” seems pretty solid. And, from there I think there is a pretty straightforward and good argument for at least thinking about the accident risk case.
I’m not sure whether the case for risk in general depends on word-games, but the case for x-risk from GPTs sure seems to. I think people came up with those word-games partly in response to people arguing that GPTs give us general AI without x-risk?
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word games.
On this, to be specific, I don’t think that suggestive use of reward is important here for the correct interpretation of the argument (though the suggestiveness of reward might lead people to thinking the argument is stronger than it actually is).
I propose that, while the object level thing is important and very much something I’d like to see addressed, it might be best separated from discussion of the communication and reasoning issues relating to imprecise words.
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Yup, I think a substantial portion of it does hinge on word games!
I think relevant people broadly agree about terms being suggestive and agree that this is bad
I think some of them don’t, because some of them (I think?) invented these terms and continue to use them. I think this website would look far different if people were careful about their definitions and word choices.
In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
I think Ajeya is reasonably careful about the word reward. (Though I think I roughly disagree with the overall vibe of the post with respect to this in various ways. In particular, the “number in the datacenter” case seems super unlikely.)
See e.g. the section starting with:
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful—and once human knowledge/control has eroded enough—an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.” For example:
More generally, I feel like the overall section here (which is the place where the reward related argument comes into force) is pretty careful about this and explains a more general notion of possible correlates that is pretty reasonable.
ETA: As in, you could replace reward with “thing that resulted in reinforcement in an online RL context” and the argument would stand totally fine.
As far as your shortform, I think the responses from Paul and Ajeya are pretty reasonable.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.
I agree that many terms are suggestive and you have to actually dissolve the term and think about the actual action of what is going on in the exact training process to understand things. If people don’t break down the term and understand the process at least somewhat mechanistically, they’ll run into trouble.
I think relevant people broadly agree about terms being suggestive and agree that this is bad; they don’t particularly dispute this. (Though probably a bunch of people think it’s less important than you do. I think these terms aren’t that bad once you’ve worked with them in a technical/ML context to a sufficient extent that you detach preexisting conceptions and think about the actual process.)
But, this is pretty different from a much stronger claim you make later:
I don’t think that “a lot of the case for AI accident risk comes down to a set of subtle word games”. (At least not in the cases for risk which seem reasonably well made to me.) And, people do really “disagree this much” about whether the case for AI accident risk comes down to word games. (But they don’t disagree this much about issues with terms being suggestive.)
It seems important to distinguish between “how bad is current word usage in terms of misleading suggestiveness” and “is the current case for AI accident risk coming down to subtle work games”. (I’m not claiming you don’t distinguish between these, I’m just claiming that arguments for the first aren’t really much evidence for the second here and that readers might miss this.)
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Separately, I agree that many specific cases for AI accident risk seem pretty poor to me. (Though the issue still doesn’t seem like a word games issue as opposed to generically sloppy reasoning or generally having bad empirical predictions.) And then ones which aren’t poor remain somewhat vague, though this is slowly improving over time.
So I basically agree with:
Edit: except that I do think the general take of “holy shit AI (and maybe the singularity), that might be a really big deal” seems pretty solid. And, from there I think there is a pretty straightforward and good argument for at least thinking about the accident risk case.
I’m not sure whether the case for risk in general depends on word-games, but the case for x-risk from GPTs sure seems to. I think people came up with those word-games partly in response to people arguing that GPTs give us general AI without x-risk?
On this, to be specific, I don’t think that suggestive use of reward is important here for the correct interpretation of the argument (though the suggestiveness of reward might lead people to thinking the argument is stronger than it actually is).
See e.g. here for further discussion.
I propose that, while the object level thing is important and very much something I’d like to see addressed, it might be best separated from discussion of the communication and reasoning issues relating to imprecise words.
I wasn’t meaning to support that claim via this essay. I was mentioning another belief of mine.
I think so.
Yup, I think a substantial portion of it does hinge on word games!
I think some of them don’t, because some of them (I think?) invented these terms and continue to use them. I think this website would look far different if people were careful about their definitions and word choices.
In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
Ryan, curious if you agree with my take here.
I disagree.
I think Ajeya is reasonably careful about the word reward. (Though I think I roughly disagree with the overall vibe of the post with respect to this in various ways. In particular, the “number in the datacenter” case seems super unlikely.)
See e.g. the section starting with:
More generally, I feel like the overall section here (which is the place where the reward related argument comes into force) is pretty careful about this and explains a more general notion of possible correlates that is pretty reasonable.
ETA: As in, you could replace reward with “thing that resulted in reinforcement in an online RL context” and the argument would stand totally fine.
As far as your shortform, I think the responses from Paul and Ajeya are pretty reasonable.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.