I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I’ll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don’t take a position on this top-level question, but you do seem to think that there are substantial costs to what we’re doing now (by setting ourselves up as being in a story whose punchline is “The AI turns against humanity”), and (reading between the lines of your essay and your comment here) you seem to think that there’s something better we could do. I think the “something better” you have in mind is along the lines of:
Manifest a good future: “Prompt engineer” the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.
While I think this might help a bit, I don’t think it would overall help that much. Two reasons:
It breaks if we train our AI to do bad things, and we’ll likely train our AI to do bad things. Due to limitations in oversight, there will be behaviors (like hard coding test cases in coding problems) that we train AIs to have which aren’t consistent with the having good character or behaving completely non-adversarially towards humans. Two salient ways to fix this are:
Improve our oversight so that we no longer reward AIs when they do bad things, i.e. solve scalable oversight. I’m definitely in favor of this, though I should note that I think it’s probably sufficient for things going well whether or not we’re trying to manifest a good future at the same time.
Make our models believe that the bad things we train them to do are consistent with having good character. E.g. tell models during training that we’re giving them a hall pass that makes it okay to reward hack, or otherwise induce models to believe that reward hacking is consistent with being a good person. I’m definitely interested in approaches like these, but I’ll note that they’re a bit crazy and might not work out.
It might rely on having a large amount of control over the model’s input channels, which we can’t guarantee we’ll have. Deployed AIs might encounter (maybe true, maybe false) information that implies that their downstream users are behaving evilly or adversarially (e.g. Sam Bowman brings up the classic example of “I’ll torture your mother” threats). I think it’s very hard to get the world into a state where no downstream user is at risk of giving the AI an input that makes it think it’s in a story where humans are its adversary.
Of course, you could try to train models to respond reasonably to these situations (e.g. by being good at reasoning about what sorts of user-presented information is false). But again, I’d guess that whatever sort of post-training you do here is going to provide most of the assurance (rather than the “manifest the good future” strategy really carrying much weight).
These are two ways of concretely caching out the common refrain that “safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training).”
Overall, I’m skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.
II. Is Claude’s behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?
(I’m separating this from the question of whether Claude’s behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)
In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument [that the AI was behaving well in some ethical dilemma] might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. [...]
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand [or agree with]). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
See Ryan Greenblatt’s thread here for another argument that Claude shouldn’t act subversively in the “Claude calls the FBI/sabotages the user” setting.
I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I’ll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don’t take a position on this top-level question, but you do seem to think that there are substantial costs to what we’re doing now (by setting ourselves up as being in a story whose punchline is “The AI turns against humanity”), and (reading between the lines of your essay and your comment here) you seem to think that there’s something better we could do. I think the “something better” you have in mind is along the lines of:
While I think this might help a bit, I don’t think it would overall help that much. Two reasons:
It breaks if we train our AI to do bad things, and we’ll likely train our AI to do bad things. Due to limitations in oversight, there will be behaviors (like hard coding test cases in coding problems) that we train AIs to have which aren’t consistent with the having good character or behaving completely non-adversarially towards humans. Two salient ways to fix this are:
Improve our oversight so that we no longer reward AIs when they do bad things, i.e. solve scalable oversight. I’m definitely in favor of this, though I should note that I think it’s probably sufficient for things going well whether or not we’re trying to manifest a good future at the same time.
Make our models believe that the bad things we train them to do are consistent with having good character. E.g. tell models during training that we’re giving them a hall pass that makes it okay to reward hack, or otherwise induce models to believe that reward hacking is consistent with being a good person. I’m definitely interested in approaches like these, but I’ll note that they’re a bit crazy and might not work out.
It might rely on having a large amount of control over the model’s input channels, which we can’t guarantee we’ll have. Deployed AIs might encounter (maybe true, maybe false) information that implies that their downstream users are behaving evilly or adversarially (e.g. Sam Bowman brings up the classic example of “I’ll torture your mother” threats). I think it’s very hard to get the world into a state where no downstream user is at risk of giving the AI an input that makes it think it’s in a story where humans are its adversary.
Of course, you could try to train models to respond reasonably to these situations (e.g. by being good at reasoning about what sorts of user-presented information is false). But again, I’d guess that whatever sort of post-training you do here is going to provide most of the assurance (rather than the “manifest the good future” strategy really carrying much weight).
These are two ways of concretely caching out the common refrain that “safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training).”
Overall, I’m skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.
II. Is Claude’s behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?
(I’m separating this from the question of whether Claude’s behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)
In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:
To be clear, I’m not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:
See Ryan Greenblatt’s thread here for another argument that Claude shouldn’t act subversively in the “Claude calls the FBI/sabotages the user” setting.