I still think this post is pretty good and I stand by the arguments. I’m really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith’s counting argument for scheming, except with somewhat less reliance on intentional scheming and more focus on the biology-like messiness inside of a trained agent.
When we wrote this we were hoping to create a reasonably comprehensive summary of the entire end-to-end argument for why we believe AI ruin is likely. I don’t think it was a total failure and I’m still fairly happy with it. It doesn’t seem to have led to these beliefs becoming much more widespread though. Most alignment research being done today still seems motivated by threat models that misunderstand the main difficulties, from my perspective.
The key thing I think is usually missing is: AGI should be thought of as a dynamic system that learns and grows. The pathway of growth depends on details of the cognitive algorithms, and these details are under-specified by training. Working through detailed examples of each thing that can be underspecified is a good way to intuitively grasp just how large of a problem this is.
Here are some ways I’d write this post differently now:
I’d factor out sections 1,2&3 and do it as a separate post about consequentialism and online learning. These are too long and get in the way of more interesting parts. They are important prerequisites, but it’s more important that people actually get to section 4 and understand it. I think we were trying too hard to convince people who don’t believe in trained AIs having goals, or don’t believe that having goals is necessary for intelligent behavior. Understanding how people reacted to this would have been helpful for writing the rest of it.
I also usually frame it differently now. I focus on the necessity of true-beliefs-that-correspond-to-reality for certain kinds of tasks, rather than the necessity of certain types of goals. Ultimately it’s the same, it’s just easier to overinterpret what I’m saying when framed in terms of goals.
We probably should have done section 5 (control schemes) as a separate post as well. I’ve really come to believe in factoring arguments whenever possible. This saves reader time, but also can make the dependencies of each belief and argument clearer, and this section was largely unrelated to the others. Control seems to have become a popular area of research, but it still looks like largely a waste of time to me, and I don’t know whether or where we went wrong in this argument.
I think section 6 was probably unnecessary, or at least should have been factored out.
Almost all of it should have been cut down a lot. I think now I could cut out at least half of the words while increasing clarity, but I didn’t know how at the time.
I agree with plex that the title is crazy. I can never remember it. Maybe now I’d call it something like “An end-to-end argument for AI doom”?
Agree! And try for the writing style where anything than less than 80% of your readers are going to want to read you put in a footnote, to make the mainline readthrough as streamlined as possible. I think this could easily become the best explainer to full doom around.
I still think this post is pretty good and I stand by the arguments. I’m really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith’s counting argument for scheming, except with somewhat less reliance on intentional scheming and more focus on the biology-like messiness inside of a trained agent.
When we wrote this we were hoping to create a reasonably comprehensive summary of the entire end-to-end argument for why we believe AI ruin is likely. I don’t think it was a total failure and I’m still fairly happy with it. It doesn’t seem to have led to these beliefs becoming much more widespread though. Most alignment research being done today still seems motivated by threat models that misunderstand the main difficulties, from my perspective.
The key thing I think is usually missing is: AGI should be thought of as a dynamic system that learns and grows. The pathway of growth depends on details of the cognitive algorithms, and these details are under-specified by training. Working through detailed examples of each thing that can be underspecified is a good way to intuitively grasp just how large of a problem this is.
Here are some ways I’d write this post differently now:
I’d factor out sections 1,2&3 and do it as a separate post about consequentialism and online learning. These are too long and get in the way of more interesting parts. They are important prerequisites, but it’s more important that people actually get to section 4 and understand it. I think we were trying too hard to convince people who don’t believe in trained AIs having goals, or don’t believe that having goals is necessary for intelligent behavior. Understanding how people reacted to this would have been helpful for writing the rest of it.
I also usually frame it differently now. I focus on the necessity of true-beliefs-that-correspond-to-reality for certain kinds of tasks, rather than the necessity of certain types of goals. Ultimately it’s the same, it’s just easier to overinterpret what I’m saying when framed in terms of goals.
We probably should have done section 5 (control schemes) as a separate post as well. I’ve really come to believe in factoring arguments whenever possible. This saves reader time, but also can make the dependencies of each belief and argument clearer, and this section was largely unrelated to the others. Control seems to have become a popular area of research, but it still looks like largely a waste of time to me, and I don’t know whether or where we went wrong in this argument.
I think section 6 was probably unnecessary, or at least should have been factored out.
Almost all of it should have been cut down a lot. I think now I could cut out at least half of the words while increasing clarity, but I didn’t know how at the time.
I agree with plex that the title is crazy. I can never remember it. Maybe now I’d call it something like “An end-to-end argument for AI doom”?
Do it! Write a new “version 2” post / post-series! It’s OK if there’s self-plagiarism. Would be time well spent.
Agree! And try for the writing style where anything than less than 80% of your readers are going to want to read you put in a footnote, to make the mainline readthrough as streamlined as possible. I think this could easily become the best explainer to full doom around.
If you write a condensed and better named version of this, Lens Academy will use it in the flagship course. p(>0.95)