I agree that the claims the Anthropic researchers are making here are kind of wacky, but there is a related / not-exactly-steelman argument that has been floating around LW for a while, namely that there is an assumption by many old-school AI alignment people that transformer models will necessarily get more coherent as they get smarter (and larger), when (according to the arguers) that assumption hasn’t been fully justified or empirically been the case so far.
I recall @nostalgebraist’s comment here as an example of this line of discussion that was highly upvoted at the time.
So a generous / benign interpretation of the “Hot mess” work is that is an attempt to empirically investigate this argument and the questions that nostalgabraist and others have posed.
Personally, I continue to think that most of these discussions are kind of missing the point of the original arguments and assumptions that they’re questioning. The actual argument that coherence and agency are deeply and closely related to the ability to usefully and generally plan, execute, and adapt in a sample-efficient way, doesn’t depend on what’s happening in any particular existing AI system or assume anything about how they will work. It might or might not be the case that these properties and abilities will emerge directly in transformer models as they get larger—or they’ll emerge as a result of putting the model in the right kind of harness / embodiment, or as part of some advancement in a post-training process deliberately designed to shape them for coherence, or they’ll emerge in some totally different architecture / paradigm—but exactly how and when that happens isn’t a crux for any of my own beliefs or worldview.
Put another way, “a country of geniuses in a datacenter” had better be pretty good at working together and pursuing complex, long time-horizon goals coherently if they want to actually get anything useful done! Whether and how the citizens of that country contain large transformer models as a key component is maybe an interesting question from a timelines / forecasting perspective or if you want to try building that country right away, but it doesn’t seem particularly relevant to what happens shortly afterwards if you actually succeed.
I agree that the claims the Anthropic researchers are making here are kind of wacky, but there is a related / not-exactly-steelman argument that has been floating around LW for a while, namely that there is an assumption by many old-school AI alignment people that transformer models will necessarily get more coherent as they get smarter (and larger), when (according to the arguers) that assumption hasn’t been fully justified or empirically been the case so far.
I recall @nostalgebraist’s comment here as an example of this line of discussion that was highly upvoted at the time.
So a generous / benign interpretation of the “Hot mess” work is that is an attempt to empirically investigate this argument and the questions that nostalgabraist and others have posed.
Personally, I continue to think that most of these discussions are kind of missing the point of the original arguments and assumptions that they’re questioning. The actual argument that coherence and agency are deeply and closely related to the ability to usefully and generally plan, execute, and adapt in a sample-efficient way, doesn’t depend on what’s happening in any particular existing AI system or assume anything about how they will work. It might or might not be the case that these properties and abilities will emerge directly in transformer models as they get larger—or they’ll emerge as a result of putting the model in the right kind of harness / embodiment, or as part of some advancement in a post-training process deliberately designed to shape them for coherence, or they’ll emerge in some totally different architecture / paradigm—but exactly how and when that happens isn’t a crux for any of my own beliefs or worldview.
Put another way, “a country of geniuses in a datacenter” had better be pretty good at working together and pursuing complex, long time-horizon goals coherently if they want to actually get anything useful done! Whether and how the citizens of that country contain large transformer models as a key component is maybe an interesting question from a timelines / forecasting perspective or if you want to try building that country right away, but it doesn’t seem particularly relevant to what happens shortly afterwards if you actually succeed.