I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Mostly agreed, with the capability-related asterisk.
Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one’s plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
I think this frame is reasonable, and I use it.
it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
Agreed.
that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
Agreed.
“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
So, maybe don’t make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.
Great post! I think this captures a lot of why I’m not ultradoomy (only, er, 45%-ish doomy, at the moment), especially A and B. I think it’s at least possible that our reality is on easymode, where muddling could conceivably put an AI into close enough territory to not trigger an oops.
I’d be even less doomy if I agreed with the counterarguments in C. Unfortunately, I can’t shake the suspicion that superintelligence is the kind of ridiculously powerful lever that would magnify small oopses into the largest possible oopses.
Hypothetically, if we took a clever human’s general capacity for problem solving, stripped it of limitations like getting bored or tired, got rid of its pesky intuitions around ethics, and sped it up by a factor of 1,000 times… I’d be very worried about what it would be able to do. Even without greater capacity for insight or an enhanced working memory, simply thinking really fast would be a broken superpower.
Such an entity might not be able to recreate the technology of modern civilization starting from scratch (both in resources and knowledge) in the stone age within 30 years, primarily due to physical interaction requirements. But starting from anything like modern civilization? That would get weird fast.
In other words, it seems like the intelligence range of humans- or even the range across animals and humans- is small compared to what is artificially possible even if we only consider speed. And it seems very likely at this point that a well-built artificial mind could have higher quality insights, too. MuZero certainly seems to within its domain. I don’t find much comfort in observable intelligence differences not always resulting in domination.