When people say that Claude is ‘mostly aligned,’ I think the crux is not whether implementing Claude’s CEV would be really bad. It’s whether a multi agent system consisting of both humans and Claude-like agents with incoherent preferences would go poorly.
Eg, one relevant question is, ‘could humans steer current Claude into doing good alignment research without it intentionally sabotaging this research’? To which I think the answer is ‘yes, though current Claude is close to useless for difficult alignment research.’ Another question is ‘if you integrated a ton of Claudes into important societal positions, would things go badly, or would the system as a whole basically work out okay?‘
Directionally I agree with your point that as AIs become smarter, they will implement something closer to CEV, and so it becomes harder to align them well enough that these questions can still be answered positively.
I think the steelman for {Nina / Ryan / Will}’s position, though, is that maybe the first human-level AIs will still be incoherent enough that the answers to these questions will still be yes, if we do a good job with alignment.
Overall, I think ‘Is this AI aligned?’ is a poorly-defined question, and it’s better to focus on practical questions surrounding 1) whether we can align the first human-level AIs well enough to do good alignment research (and whether this research will be sufficiently useful), 2) whether these AIs will take harmful actions, 3) how coherent these actions will be. I think it’s pretty unclear how well a scaled-up version of Claude does on these metrics, but it seems possible that it does reasonably well.
Solid comment, thanks. I agree with nearly all of this. I chose to highlight the question of whether Claude et al are really aligned because it feels like an important prerequisite to the next couple posts, forthcoming. I think “incoherent enough to be safe but coherent enough to do alignment research” seems like a very unstable and unlikely state.
When people say that Claude is ‘mostly aligned,’ I think the crux is not whether implementing Claude’s CEV would be really bad. It’s whether a multi agent system consisting of both humans and Claude-like agents with incoherent preferences would go poorly.
Eg, one relevant question is, ‘could humans steer current Claude into doing good alignment research without it intentionally sabotaging this research’? To which I think the answer is ‘yes, though current Claude is close to useless for difficult alignment research.’ Another question is ‘if you integrated a ton of Claudes into important societal positions, would things go badly, or would the system as a whole basically work out okay?‘
Directionally I agree with your point that as AIs become smarter, they will implement something closer to CEV, and so it becomes harder to align them well enough that these questions can still be answered positively.
I think the steelman for {Nina / Ryan / Will}’s position, though, is that maybe the first human-level AIs will still be incoherent enough that the answers to these questions will still be yes, if we do a good job with alignment.
Overall, I think ‘Is this AI aligned?’ is a poorly-defined question, and it’s better to focus on practical questions surrounding 1) whether we can align the first human-level AIs well enough to do good alignment research (and whether this research will be sufficiently useful), 2) whether these AIs will take harmful actions, 3) how coherent these actions will be. I think it’s pretty unclear how well a scaled-up version of Claude does on these metrics, but it seems possible that it does reasonably well.
Solid comment, thanks. I agree with nearly all of this. I chose to highlight the question of whether Claude et al are really aligned because it feels like an important prerequisite to the next couple posts, forthcoming. I think “incoherent enough to be safe but coherent enough to do alignment research” seems like a very unstable and unlikely state.