I think this is really good and important. Big upvote.
I largely agree: for these reasons, the default plan is very bad, and far too likely to fail.
The AGI is on your side, until it isn’t. There’s not much basin. I note that the optimistic quote you lead with explicitly includes “you need to solve alignment”.
Even though I’ve argued that Instruction-following is easier than value alignment, including some optimism about roughly the basin of alignment idea, I now agree that there really isn’t much of a basin. I think there may be some real help from roughly human-level AGI that still thinks it’s aligned, and/or is functionally aligned in those use cases (it hasn’t yet hit much of “the ocean” in your metaphor). That could be really useful. But as soon as it realizes it’s misaligned (see my reasoning post below) or hits severely OOD contexts, it will be just as against you as it was for you shortly before. There’s no real basin keeping it in, just some help in guessing how it or its next generation might become misaligned.
I really like the shipbuilding metaphor. I think we’re desperately in need of more precise and specific discussion on this topic, and more specific, engineering-related metaphors seem like a good way forward.
In that metaphor, I’d like to see more work on modeling conditions out at sea. That’s how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.
After doing that exercise in the course of writing that mega-post, my specific estimates are a bit different from yours, but they’re qualitatively similar.
Talking to you helped shift me in the pessimistic direction, although I did reach out asking to talk because I was on a project of really staring into the abyss of deep alignment worries.
I now think the current path is more likely than not to get us all killed. I don’t enjoy thinking that, and I’ve done a lot of work trying to correct for my biases.
But I think the full theory and the full story is unwritten. I think there’s a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path AND trying to slow or stop progress.
Based on that uncertainty, I think it’s quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that’s what I’m working on; more soon.
I think there may be some real help from roughly human-level AGI that still thinks it’s aligned, and/or is functionally aligned in those use cases (it hasn’t yet hit much of “the ocean” in your metaphor). That could be really useful.
Yeah I agree with this, although the only way I can concretely imagine it helping is via the scenario I wrote at the end of the post.
I’d like to see more work on modeling conditions out at sea. That’s how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.
Agreed, although three comments:
I think you’re shooting yourself in the foot here by choosing the building material in advance (LLMs).
Separately, it’s a terrible building material, it lacks transparency or hooks or modularity.
It’s difficult to make intellectual progress on topics like this unless it’s a bit more formal. To some extent for communication reasons, we need to be able to iterate and build on each others work. But also because human reasoning is just too fuzzy to be doing this kind of OOD guesswork.
So I consider tiling agents and logical induction to be good work modelling the distribution shifts of learning/reasoning and self-modification.
I think there’s a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path
You’ve gotta be careful with this particular chain of reasoning. It may look like model uncertainty implies hope, but that doesn’t mean it implies hope in any specific work.
involves working toward alignment on the current path AND trying to slow or stop progress.
These two things can sometimes be opposed to each other. E.g. I’d bet that alignment work at OpenAI has been overall negative via giving everyone there the impression that they might be doing the right thing.
Based on that uncertainty, I think it’s quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that’s what I’m working on; more soon.
Mining your model uncertainty for concretely useful actions just doesn’t seem like the kind of thing that could work. I look forward to reading about it anyway.
That feels like flag waving or rationalizing or something. Obviously it’s not that simple, and you know that. You’ve got to do the prioritisation calculation. I don’t think it works out in that direction, but even if it did you should argue that by comparing the difficulties and counterfactual value of each research agenda.
I think this is really good and important. Big upvote.
I largely agree: for these reasons, the default plan is very bad, and far too likely to fail.
The AGI is on your side, until it isn’t. There’s not much basin. I note that the optimistic quote you lead with explicitly includes “you need to solve alignment”.
Even though I’ve argued that Instruction-following is easier than value alignment, including some optimism about roughly the basin of alignment idea, I now agree that there really isn’t much of a basin. I think there may be some real help from roughly human-level AGI that still thinks it’s aligned, and/or is functionally aligned in those use cases (it hasn’t yet hit much of “the ocean” in your metaphor). That could be really useful. But as soon as it realizes it’s misaligned (see my reasoning post below) or hits severely OOD contexts, it will be just as against you as it was for you shortly before. There’s no real basin keeping it in, just some help in guessing how it or its next generation might become misaligned.
I really like the shipbuilding metaphor. I think we’re desperately in need of more precise and specific discussion on this topic, and more specific, engineering-related metaphors seem like a good way forward.
In that metaphor, I’d like to see more work on modeling conditions out at sea. That’s how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.
I used that framing and those mechanisms for what’s approximately my version of this argument: LLM AGI may reason about its goals and discover misalignments by default. That also resulted from doing (what I think is) the exercise you suggest.
After doing that exercise in the course of writing that mega-post, my specific estimates are a bit different from yours, but they’re qualitatively similar.
Talking to you helped shift me in the pessimistic direction, although I did reach out asking to talk because I was on a project of really staring into the abyss of deep alignment worries.
I now think the current path is more likely than not to get us all killed. I don’t enjoy thinking that, and I’ve done a lot of work trying to correct for my biases.
But I think the full theory and the full story is unwritten. I think there’s a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path AND trying to slow or stop progress.
Based on that uncertainty, I think it’s quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that’s what I’m working on; more soon.
Yeah I agree with this, although the only way I can concretely imagine it helping is via the scenario I wrote at the end of the post.
Agreed, although three comments:
I think you’re shooting yourself in the foot here by choosing the building material in advance (LLMs).
Separately, it’s a terrible building material, it lacks transparency or hooks or modularity.
It’s difficult to make intellectual progress on topics like this unless it’s a bit more formal. To some extent for communication reasons, we need to be able to iterate and build on each others work. But also because human reasoning is just too fuzzy to be doing this kind of OOD guesswork.
So I consider tiling agents and logical induction to be good work modelling the distribution shifts of learning/reasoning and self-modification.
You’ve gotta be careful with this particular chain of reasoning. It may look like model uncertainty implies hope, but that doesn’t mean it implies hope in any specific work.
These two things can sometimes be opposed to each other. E.g. I’d bet that alignment work at OpenAI has been overall negative via giving everyone there the impression that they might be doing the right thing.
Mining your model uncertainty for concretely useful actions just doesn’t seem like the kind of thing that could work. I look forward to reading about it anyway.
Trying to align LLMs just doesn’t seem optional to me. It’s what’s happening whether or not we like it.
That feels like flag waving or rationalizing or something. Obviously it’s not that simple, and you know that. You’ve got to do the prioritisation calculation. I don’t think it works out in that direction, but even if it did you should argue that by comparing the difficulties and counterfactual value of each research agenda.