I hope to find time to give a more thorough reply later; what I say below is hasty and may contain errors.
(1) Define general competence factor as intelligence*coherence.
Take all the classic arguments about AI risk and ctrl-f “intelligence” and then replace it with “general competence factor.”
The arguments now survive your objection, I think.
When we select for powerful AGIs, when we train them to do stuff for us, we are generally speaking also training them to be coherent. It’s more accurate to say we are training them to have a high general competence factor, then to say we are training them to be intelligent-but-not-necessarily-coherent. The ones that aren’t so coherent will struggle to take over the world, yes, but they will also struggle to compete in the marketplace (and possibly even the training environment) with the ones that are.
(2) I’m a bit annoyed by the various bits of this post/paper that straw man the AI safety community, e.g. saying that X is commonly assumed when in fact there are pages and pages of argumentation supporting X, which you can easily find with a google, and lots more pages of argumentation on both sides of the issue as to whether X.
Relatedly… I just flat-out reject the premise that most work on AI risk assumes that AI will be less of a hot mess than humans. I for one am planning for a world where AIs are about as much of a hot mess as humans, at least at first. I think it’ll be a great achievement (relative to our current trajectory) if we can successfully leverage hot-mess AIs to end the acute risk period.
(3) That said, I’m intellectually curious/excited to discuss these results and arguments with you, and grateful that you did this research & posted it here. :) Onwards to solving these problems collaboratively!
Seems like the concept of “coherence” used here is inclined to treat simple stimulus-response behavior as highly coherent. e.g., The author puts a thermostat in the supercoherent unintelligent corner of one of his graphs.
But stimulus-response behavior, like a blue-minimizing robot, only looks like coherent goal pursuit in a narrow set of contexts. The relationship between its behavioral patterns and its progress towards goals is context-dependent, and will go off the rails if you take it out of the narrow set of contexts where it fits. That’s not “a hot mess of self-undermining behavior”, so it’s not the lack-of-coherence that this question was designed to get at.
Here’s a hypothesis about the inverse correlation arising from your observation: When we evaluate a thing’s coherence, we sample behaviours in environments we expect to find the thing in. More intelligent things operate in a wider variety of environments, and the environmental diversity leads to behavioural diversity that we attribute to a lack of coherence.
I don’t think it’s “wider variety of environment” per se. An amoeba and a human operate in the same environments: everywhere there is human, there are amoeba. They are just a very common sort of microscopic life; you probably have some on or even in you as a commensal right now. And amoeba are in many environments that few or no humans are, like at the bottom of ponds and stuff like that. Similarly, leaving aside the absurdity of asking people to compare how consistent agents are ‘ants’ vs ‘sloths’, I’d say ants inhabit much more diverse environments than sloths, whether you consider ants collectively as they colonize the world from pole to pole or any individual ant going from the ant hill’s nursery to the outside to forage and battle, while sloths exist in a handful of biomes and an individual sloth moves from branch to branch, and that’s about it. (And then there are the social complications: you know about ant warfare, of course, and the raids and division of territory and enslavement etc, but did you know some ants can hold “tournaments” (excerpts), where two hills bloodlessly compete, and the loser allows itself to be killed and its survivors enslaved by the winner? I am not aware of anything sloths do which is nearly as complicated.) Or take thermostats: thermostats are everywhere from deep space probes past the orbit of Pluto to deep sea vents; certainly a wider range of ‘environments’ than humans. And while it is even more absurd to ask such a question about “ResNet-18” vs “CLIP”—surely the question of what a ResNet-18 wants and how coherent it is as an agent is well above the pay-grade of everyone surveyed… - these are the sorts of image models which are deployed everywhere by everyone to classify/embed everything, and the small fast old ResNet-18 is probably deployed into many more environments than CLIP models are, particularly in embedded computers. So by any reasonable definition of ‘environment’, the ‘small’ things here are in as many or more as the ‘large’ ones.
So much for ‘it’s the environment’. It is the behavioral diversity here. And of course, a human can exhibit much more behavioral complexity than an amoeba, but the justification of our behavioral complexity is not complexity for the sake of complexity. Complexity is a bad thing. An amoeba finds complexity to be not cost-effective. We find complexity cost-effective because it buys us power. A human has far greater power over its environment than an amoeba does over that exact same environment.
So let’s go back to OP with that in mind. Why should we care about these absurd comparisons, according to Sohl-Dickstein?
If the AI is powerful enough, and pursues its objectives inflexibly enough, then even a subtle misalignment might pose an existential risk to humanity. For instance, if an AI is tasked by the owner of a paperclip company to maximize paperclip production, and it is powerful enough, it will decide that the path to maximum paperclips involves overthrowing human governments, and paving the Earth in robotic paperclip factories.
There is an assumption behind this misalignment fear, which is that a superintelligent AI will also be supercoherent in its behavior^1. An AI could be misaligned because it narrowly pursues the wrong goal (supercoherence). An AI could also be misaligned because it acts in ways that don’t pursue any consistent goal (incoherence). Humans — apparently the smartest creatures on the planet — are often incoherent. We are a hot mess of inconsistent, self-undermining, irrational behavior, with objectives that change over time. Most work on AGI misalignment risk assumes that, unlike us, smart AI will not be a hot mess.
In this post, I experimentally probe the relationship between intelligence and coherence in animals, people, human organizations, and machine learning models. The results suggest that as entities become smarter, they tend to become less, rather than more, coherent. This suggests that superhuman pursuit of a misaligned goal is not a likely outcome of creating AGI.
So the argument here is something like:
an agent can be harmed by another more agent only if that agent is also more ‘coherent in planning’ than the first agent
agents which are more powerful are less coherent (as proven by the following empirical data etc.)
therefore, weaker agents can’t be harmed by more powerful agents
humans are weaker agents than superintelligent AI, etc etc, humans can’t be harmed
Put like this, Sohl-Dickstein’s argument is obviously wrong.
It is entirely unnecessary to get distracted arguing about #2 or asking whether there is too much rater error to bother with or calculating confidence intervals when his argument is so basically flawed to begin with. (You might say it is like trying to disprove AI risk by calculating exactly how hard quantum uncertainty makes predicting pinball. It is a precise answer to the wrong question.)
Leaving aside whether #2 is true, #1 is obviously false and proves too much: it is neither necessary nor sufficient to be more coherent in order to cause bad things to happen.
Consider the following parodic version of OP:
If the “human being” is powerful enough, and pursues its objectives inflexibly enough, then even a subtle misalignment might pose an existential risk to amoeba-kind. For instance, if a human is tasked by the owner of a food factory to maximize food production, and it is powerful enough, it will decide that the path to maximum hygienic food output involves mass genocide of amoeba-kind using bleach and soap, and paving the Earth in fields of crops.
There is an assumption behind this misalignment fear, which is that a superintelligent non-amoeba will also be supercoherent in its behavior^1. A human could be misaligned because it narrowly pursues the wrong goal (supercoherence). An human could also be misaligned because it acts in ways that don’t pursue any consistent goal (incoherence). Amoeba—apparently the smartest creatures on the planet that we amoeba know of—are often incoherent. We are a hot mess of inconsistent, self-undermining, irrational behavior, with objectives that change over time. Most work on human misalignment risk assumes that, unlike us, smart humans will not be a hot mess.
In this post, I experimentally probe the relationship between intelligence and coherence in microbial life, humans, human organizations, and machine learning models. The results suggest that as entities become smarter, they tend to become less, rather than more, coherent. This suggests that superamoeba pursuit of a misaligned goal is not a likely outcome of creating humanity.
Where, exactly, does this version go wrong? If ‘hot mess’ does not disprove this amoeba:human version, why does it disprove human:AGI in the original version?
Well, obviously, more powerful agents do harm or destroy less powerful agents all the time, even if those less powerful agents are more coherent or consistent. They do it by intent, they do it by accident, they do it in a myriad of ways, often completely unwittingly. You could be millions of times less coherent than an amoeba, in some sort of objective sense like ‘regret’ against your respective reward functions*, and yet, you are far more than millions of times more powerful than an amoeba is: you destroy more-coherent amoebas all the time by basic hygiene or by dumping some bleach down the drain or spilling some cleaning chemicals on the floor; and their greater coherence boots nothing. I’ve accidentally damaged and destroyed thermostats that were perfectly coherent in their temperature behavior, it did them no good. ‘AlphaGo’ may be less coherent than a ‘linear CIFAR-10 classifier’, but nevertheless it still destroys weaker agents in a zero-sum war (Go). The US Congress may be ranked near the bottom for coherence, and yet, it still passes the laws that bind (and often harm) Sohl-Dickstein and myself and presumably most of the readers of this argument. And so on.
The power imbalances are simply so extreme that even some degradation in ‘coherence’ is still consistent with enormous negative consequences, both as deliberate plans, and also as sheer accidents or side-effects.
(This reminds me a little of some mistaken arguments about central limit theorems. It is true that if you add up a few variables of a given size, as a percentage, the randomness around the mean will be much larger than if you added up a lot of that size; but what people tend to forget is that the new randomness will be much larger absolutely. So if you are trying to minimize an extreme outcome, like an insurance company trying to avoid any loss of $X, you cannot do so by simply insuring more contracts because your worst-cases keep getting bigger even as they become less likely. The analogy here is that the more powerful agents are around, the more destructive their outliers become, even if they have some neutralizing trend going on as well.)
* Assuming you are willing to naively equate utilities of eg amoebas and humans, it would then not be surprising if humans incurred vastly more total regret over a lifetime, because they have much longer lifetimes with much greater action-spaces, and most RL algorithms have regret bounds that scale with # of actions & timescale. Which seems intuitive. If you have very few choices and your choices make little difference long-term, in part because there isn’t a long-term for you (amoebas live days to years, not decades), the optimal policy can’t improve much on whatever you do.
To make any argument from lower coherence, you need to make it work with a much weaker version of #1, like ‘an agent is less likely to be harmed by a more powerful agent than that power differential implies if the more powerful agent is also less coherent’.
But this is a much harder argument: how much less? How fast does incoherence need to increase in order to more than offset the corresponding increase in power? How do you quantify that at all?
How do you know that this incoherence effect is not already incorporated into any existing estimate of ‘power’ (In fact, how would you estimate ‘power’ in any meaningful sense which doesn’t already incorporate an agent’s difficulty in being coherent? Do you have to invent some computable ‘supercoherent version’ of an agent to run? If you can do that, wouldn’t agents be highly incentivized to invent that super-coherent version of themselves to replace themselves with?)
You can retreat to, ‘well, the lower coherence means that any increased power will be applied less effectively and so the risk is at least epsilon lower than claimed’, but you can’t even show that this isn’t already absorbed into existing fudge factors and measurements and extrapolations and arguments etc. So you wind up claiming basically nothing at all.
Without thinking about it too much, this fits my intuitive sense. An amoeba can’t possibly demonstrate a high level of incoherence because it simply can’t do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction.
More intelligent agents have a larger set of possible courses of action that they’re potentially capable of evaluating and carrying out. But picking an option from a larger set is harder than picking an option from a smaller set. So max performance grows faster than typical performance as intelligence increases, and errors look more like ‘disarray’ than like ‘just not being capable of that’. e.g. Compare a human who left the window open while running the heater on a cold day, with a thermostat that left the window open while running the heater.
A Second Hypothesis: Higher intelligence often involves increasing generality—having a larger set of goals, operating across a wider range of environments. But that increased generality makes the agent less predictable by an observer who is modeling the agent as using means-ends reasoning, because the agent is not just relying on a small number of means-ends paths in the way that a narrower agent would. This makes the agent seem less coherent in a sense, but that is not because the agent is less goal-directed (indeed, it might be more goal-directed and less of a stimulus-response machine).
These seem very relevant for comparing very different agents: comparisons across classes, or of different species, or perhaps for comparing different AI models. Less clear that they would apply for comparing different humans, or different organizations.
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of “coherence” vs. “hot mess”, it is more fruitful to think about “how much influence is this system exerting on its environment?”. Too much influence will kill humans, if directed at an outcome we’re not able to choose. (The rest of my comments are all variations on this basic theme).
We humans may be a hot mess, but we’re far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a “hot mess”.
It’s much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it’s natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
Misalignment risk is not about expecting a system to “inflexibly” or “monomanically” pursuing a simple objective. It’s about expecting systems to pursue objectives at all. The objectives don’t need to be simple or easy to understand.
Intelligence isn’t the right measure to have on the X-axis—it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: “how good is this entity at going out into the world and getting more of what it wants?”
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
If we build something more capable than humans in a certain domain, we should expect it to be “coherent” in the sense that it will not make any mistakes that a smart human wouldn’t have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we’ll see something similar even in the kind of system I’d call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like “fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives”.
The organizations one seems like an obvious collider—you got the list by selecting for something like “notability,” which is contributed to by both intelligence and coherence, and so on the sample it makes sense they’re anticorrelated.
But I think the rankings for animals/plants isn’t like that. Instead, it really seems to trade on what people mean by “coherence”—here I agree with Unnamed, it seems like “coherence” is getting interpreted as “simplicity of models that work pretty well to describe the thing,” even if those models don’t look like utility maximizers. Put an oak tree in a box with a lever that dispenses water, and it won’t pull the lever when it’s thirsty, but because the overall model that describes an oak tree is simpler than the model that describes a rat, it feels “coherent.” This is a fine way to use the word, but it’s not quite what’s relevant to arguments about AI.
Put an oak tree in a box with a lever that dispenses water, and it won’t pull the lever when it’s thirsty
I actually thought this was a super interesting question, just for general world modelling. The tree won’t pull a lever because it barely has the capability to do so and no prior that it might work, but it could, like, control a water dispenser via sap distribution to a particular branch. In that case will the tree learn to use it?
Ended up finding an article on attempts to show learned behavioural responses to stimuli in plants at On the Conditioning of Plants: A Review of Experimental Evidence—turns out there have been some positive results but they seem not to have replicated, as well as lots of negative results, so my guess is that no, even if they are given direct control, the tree won’t control its own water supply. More generally this would agree that plants lack the information processing systems to coherently use their tools.
Experiments are mostly done with M. pudica because it shows (fairly) rapid movement to close up its leaves when shaken.
Not sure I would agree about a single ant being coherent. Aren’t ants super dependent on their colonies for reasonable behavior? Like they use pheromone trails to find food and so on.
Also AFAIK a misplaced pheromone trail can lead an ant to walk in circles, which is the archetypal example of incoherence. But I don’t know how much that happens in practice.
This is a cool result—I think it’s really not obvious why intelligence and “coherence” seem inversely correlated, but it’s interesting that you replicated it across three different classes of things (ML models, animals, organisations).
I think it’s misleading to describe this as finding that intelligence and coherence are actually inversely correlated. Rather, survey respondents’ ratings of intelligence and ratings of coherence were inversely correlated.
An AI could also be misaligned because it acts in ways that don’t pursue any consistent goal (incoherence).
It’s worth noting that this definition of incoherence seems inconsistent with VNM. Eg. A rock might satisfy the folk definition of “pursuing a consistent goal,” but fail to satisfy VNM due to lacking completeness (and by corollary due to not performing expected utility optimization over the outcome space).
The result is surprising and raises interesting questions about the nature of coherence. Even if this turns out to be a fluke, I predict that it’d be an informative one.
I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
My takes about the results
Yep the main effect seems to be a negative relationship between intelligence and coherence
The poor inter-rater reliability for coherence seems like a big deal
Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
Other thoughts
The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.
1) The currently leading models (LLMs) are ultimate hot messes;
2) The whole point of G in AGI is that it can do many things; focusing on a single goal is possible, but is not a “natural mode” for general intelligence.
Against:
A superintelligent system will probably have enough capacity overhang to create multiple threads which would look to us like supercoherent superintelligent threads, so even a single system is likely to lead to multiple “virtual supercoherent superintelligent AIs” among other less coherent and more exploratory behaviors it would also perform.
But it’s a good argument against a supercoherent superintelligent singleton (even a single system which does have supercoherent superintelligent subthreads is likely to have a variety of those).
I think this is taking aim at Yudkowskian arguments that are not cruxy for AI takeover risk as I see it. The second species doesn’t need to be supercoherent in order to kill us or put us in a box; human levels of coherence will do fine for that.
I hope to find time to give a more thorough reply later; what I say below is hasty and may contain errors.
(1)
Define general competence factor as intelligence*coherence.
Take all the classic arguments about AI risk and ctrl-f “intelligence” and then replace it with “general competence factor.”
The arguments now survive your objection, I think.
When we select for powerful AGIs, when we train them to do stuff for us, we are generally speaking also training them to be coherent. It’s more accurate to say we are training them to have a high general competence factor, then to say we are training them to be intelligent-but-not-necessarily-coherent. The ones that aren’t so coherent will struggle to take over the world, yes, but they will also struggle to compete in the marketplace (and possibly even the training environment) with the ones that are.
(2) I’m a bit annoyed by the various bits of this post/paper that straw man the AI safety community, e.g. saying that X is commonly assumed when in fact there are pages and pages of argumentation supporting X, which you can easily find with a google, and lots more pages of argumentation on both sides of the issue as to whether X.
Relatedly… I just flat-out reject the premise that most work on AI risk assumes that AI will be less of a hot mess than humans. I for one am planning for a world where AIs are about as much of a hot mess as humans, at least at first. I think it’ll be a great achievement (relative to our current trajectory) if we can successfully leverage hot-mess AIs to end the acute risk period.
(3) That said, I’m intellectually curious/excited to discuss these results and arguments with you, and grateful that you did this research & posted it here. :) Onwards to solving these problems collaboratively!
Seems like the concept of “coherence” used here is inclined to treat simple stimulus-response behavior as highly coherent. e.g., The author puts a thermostat in the supercoherent unintelligent corner of one of his graphs.
But stimulus-response behavior, like a blue-minimizing robot, only looks like coherent goal pursuit in a narrow set of contexts. The relationship between its behavioral patterns and its progress towards goals is context-dependent, and will go off the rails if you take it out of the narrow set of contexts where it fits. That’s not “a hot mess of self-undermining behavior”, so it’s not the lack-of-coherence that this question was designed to get at.
Here’s a hypothesis about the inverse correlation arising from your observation: When we evaluate a thing’s coherence, we sample behaviours in environments we expect to find the thing in. More intelligent things operate in a wider variety of environments, and the environmental diversity leads to behavioural diversity that we attribute to a lack of coherence.
I don’t think it’s “wider variety of environment” per se. An amoeba and a human operate in the same environments: everywhere there is human, there are amoeba. They are just a very common sort of microscopic life; you probably have some on or even in you as a commensal right now. And amoeba are in many environments that few or no humans are, like at the bottom of ponds and stuff like that. Similarly, leaving aside the absurdity of asking people to compare how consistent agents are ‘ants’ vs ‘sloths’, I’d say ants inhabit much more diverse environments than sloths, whether you consider ants collectively as they colonize the world from pole to pole or any individual ant going from the ant hill’s nursery to the outside to forage and battle, while sloths exist in a handful of biomes and an individual sloth moves from branch to branch, and that’s about it. (And then there are the social complications: you know about ant warfare, of course, and the raids and division of territory and enslavement etc, but did you know some ants can hold “tournaments” (excerpts), where two hills bloodlessly compete, and the loser allows itself to be killed and its survivors enslaved by the winner? I am not aware of anything sloths do which is nearly as complicated.) Or take thermostats: thermostats are everywhere from deep space probes past the orbit of Pluto to deep sea vents; certainly a wider range of ‘environments’ than humans. And while it is even more absurd to ask such a question about “ResNet-18” vs “CLIP”—surely the question of what a ResNet-18 wants and how coherent it is as an agent is well above the pay-grade of everyone surveyed… - these are the sorts of image models which are deployed everywhere by everyone to classify/embed everything, and the small fast old ResNet-18 is probably deployed into many more environments than CLIP models are, particularly in embedded computers. So by any reasonable definition of ‘environment’, the ‘small’ things here are in as many or more as the ‘large’ ones.
So much for ‘it’s the environment’. It is the behavioral diversity here. And of course, a human can exhibit much more behavioral complexity than an amoeba, but the justification of our behavioral complexity is not complexity for the sake of complexity. Complexity is a bad thing. An amoeba finds complexity to be not cost-effective. We find complexity cost-effective because it buys us power. A human has far greater power over its environment than an amoeba does over that exact same environment.
So let’s go back to OP with that in mind. Why should we care about these absurd comparisons, according to Sohl-Dickstein?
So the argument here is something like:
an agent can be harmed by another more agent only if that agent is also more ‘coherent in planning’ than the first agent
agents which are more powerful are less coherent (as proven by the following empirical data etc.)
therefore, weaker agents can’t be harmed by more powerful agents
humans are weaker agents than superintelligent AI, etc etc, humans can’t be harmed
Put like this, Sohl-Dickstein’s argument is obviously wrong.
It is entirely unnecessary to get distracted arguing about #2 or asking whether there is too much rater error to bother with or calculating confidence intervals when his argument is so basically flawed to begin with. (You might say it is like trying to disprove AI risk by calculating exactly how hard quantum uncertainty makes predicting pinball. It is a precise answer to the wrong question.)
Leaving aside whether #2 is true, #1 is obviously false and proves too much: it is neither necessary nor sufficient to be more coherent in order to cause bad things to happen. Consider the following parodic version of OP:
Where, exactly, does this version go wrong? If ‘hot mess’ does not disprove this amoeba:human version, why does it disprove human:AGI in the original version?
Well, obviously, more powerful agents do harm or destroy less powerful agents all the time, even if those less powerful agents are more coherent or consistent. They do it by intent, they do it by accident, they do it in a myriad of ways, often completely unwittingly. You could be millions of times less coherent than an amoeba, in some sort of objective sense like ‘regret’ against your respective reward functions*, and yet, you are far more than millions of times more powerful than an amoeba is: you destroy more-coherent amoebas all the time by basic hygiene or by dumping some bleach down the drain or spilling some cleaning chemicals on the floor; and their greater coherence boots nothing. I’ve accidentally damaged and destroyed thermostats that were perfectly coherent in their temperature behavior, it did them no good. ‘AlphaGo’ may be less coherent than a ‘linear CIFAR-10 classifier’, but nevertheless it still destroys weaker agents in a zero-sum war (Go). The US Congress may be ranked near the bottom for coherence, and yet, it still passes the laws that bind (and often harm) Sohl-Dickstein and myself and presumably most of the readers of this argument. And so on.
The power imbalances are simply so extreme that even some degradation in ‘coherence’ is still consistent with enormous negative consequences, both as deliberate plans, and also as sheer accidents or side-effects.
(This reminds me a little of some mistaken arguments about central limit theorems. It is true that if you add up a few variables of a given size, as a percentage, the randomness around the mean will be much larger than if you added up a lot of that size; but what people tend to forget is that the new randomness will be much larger absolutely. So if you are trying to minimize an extreme outcome, like an insurance company trying to avoid any loss of $X, you cannot do so by simply insuring more contracts because your worst-cases keep getting bigger even as they become less likely. The analogy here is that the more powerful agents are around, the more destructive their outliers become, even if they have some neutralizing trend going on as well.)
* Assuming you are willing to naively equate utilities of eg amoebas and humans, it would then not be surprising if humans incurred vastly more total regret over a lifetime, because they have much longer lifetimes with much greater action-spaces, and most RL algorithms have regret bounds that scale with # of actions & timescale. Which seems intuitive. If you have very few choices and your choices make little difference long-term, in part because there isn’t a long-term for you (amoebas live days to years, not decades), the optimal policy can’t improve much on whatever you do.
To make any argument from lower coherence, you need to make it work with a much weaker version of #1, like ‘an agent is less likely to be harmed by a more powerful agent than that power differential implies if the more powerful agent is also less coherent’.
But this is a much harder argument: how much less? How fast does incoherence need to increase in order to more than offset the corresponding increase in power? How do you quantify that at all?
How do you know that this incoherence effect is not already incorporated into any existing estimate of ‘power’ (In fact, how would you estimate ‘power’ in any meaningful sense which doesn’t already incorporate an agent’s difficulty in being coherent? Do you have to invent some computable ‘supercoherent version’ of an agent to run? If you can do that, wouldn’t agents be highly incentivized to invent that super-coherent version of themselves to replace themselves with?)
You can retreat to, ‘well, the lower coherence means that any increased power will be applied less effectively and so the risk is at least epsilon lower than claimed’, but you can’t even show that this isn’t already absorbed into existing fudge factors and measurements and extrapolations and arguments etc. So you wind up claiming basically nothing at all.
Without thinking about it too much, this fits my intuitive sense. An amoeba can’t possibly demonstrate a high level of incoherence because it simply can’t do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction.
A hypothesis for the negative correlation:
More intelligent agents have a larger set of possible courses of action that they’re potentially capable of evaluating and carrying out. But picking an option from a larger set is harder than picking an option from a smaller set. So max performance grows faster than typical performance as intelligence increases, and errors look more like ‘disarray’ than like ‘just not being capable of that’. e.g. Compare a human who left the window open while running the heater on a cold day, with a thermostat that left the window open while running the heater.
A Second Hypothesis: Higher intelligence often involves increasing generality—having a larger set of goals, operating across a wider range of environments. But that increased generality makes the agent less predictable by an observer who is modeling the agent as using means-ends reasoning, because the agent is not just relying on a small number of means-ends paths in the way that a narrower agent would. This makes the agent seem less coherent in a sense, but that is not because the agent is less goal-directed (indeed, it might be more goal-directed and less of a stimulus-response machine).
These seem very relevant for comparing very different agents: comparisons across classes, or of different species, or perhaps for comparing different AI models. Less clear that they would apply for comparing different humans, or different organizations.
(Crossposting some of my twitter comments).
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of “coherence” vs. “hot mess”, it is more fruitful to think about “how much influence is this system exerting on its environment?”. Too much influence will kill humans, if directed at an outcome we’re not able to choose. (The rest of my comments are all variations on this basic theme).
We humans may be a hot mess, but we’re far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a “hot mess”.
It’s much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it’s natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.
Misalignment risk is not about expecting a system to “inflexibly” or “monomanically” pursuing a simple objective. It’s about expecting systems to pursue objectives at all. The objectives don’t need to be simple or easy to understand.
Intelligence isn’t the right measure to have on the X-axis—it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: “how good is this entity at going out into the world and getting more of what it wants?”
In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.
If we build something more capable than humans in a certain domain, we should expect it to be “coherent” in the sense that it will not make any mistakes that a smart human wouldn’t have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we’ll see something similar even in the kind of system I’d call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like “fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives”.
The organizations one seems like an obvious collider—you got the list by selecting for something like “notability,” which is contributed to by both intelligence and coherence, and so on the sample it makes sense they’re anticorrelated.
But I think the rankings for animals/plants isn’t like that. Instead, it really seems to trade on what people mean by “coherence”—here I agree with Unnamed, it seems like “coherence” is getting interpreted as “simplicity of models that work pretty well to describe the thing,” even if those models don’t look like utility maximizers. Put an oak tree in a box with a lever that dispenses water, and it won’t pull the lever when it’s thirsty, but because the overall model that describes an oak tree is simpler than the model that describes a rat, it feels “coherent.” This is a fine way to use the word, but it’s not quite what’s relevant to arguments about AI.
I actually thought this was a super interesting question, just for general world modelling. The tree won’t pull a lever because it barely has the capability to do so and no prior that it might work, but it could, like, control a water dispenser via sap distribution to a particular branch. In that case will the tree learn to use it?
Ended up finding an article on attempts to show learned behavioural responses to stimuli in plants at On the Conditioning of Plants: A Review of Experimental Evidence—turns out there have been some positive results but they seem not to have replicated, as well as lots of negative results, so my guess is that no, even if they are given direct control, the tree won’t control its own water supply. More generally this would agree that plants lack the information processing systems to coherently use their tools.
Experiments are mostly done with M. pudica because it shows (fairly) rapid movement to close up its leaves when shaken.
Huh, really neat.
Not sure I would agree about a single ant being coherent. Aren’t ants super dependent on their colonies for reasonable behavior? Like they use pheromone trails to find food and so on.
Also AFAIK a misplaced pheromone trail can lead an ant to walk in circles, which is the archetypal example of incoherence. But I don’t know how much that happens in practice.
This is a cool result—I think it’s really not obvious why intelligence and “coherence” seem inversely correlated, but it’s interesting that you replicated it across three different classes of things (ML models, animals, organisations).
I think it’s misleading to describe this as finding that intelligence and coherence are actually inversely correlated. Rather, survey respondents’ ratings of intelligence and ratings of coherence were inversely correlated.
Sure, and that’s why I said “coherence” instead of coherence
Epistemic status: clumsy
It’s worth noting that this definition of incoherence seems inconsistent with VNM. Eg. A rock might satisfy the folk definition of “pursuing a consistent goal,” but fail to satisfy VNM due to lacking completeness (and by corollary due to not performing expected utility optimization over the outcome space).
Strong upvoted.
The result is surprising and raises interesting questions about the nature of coherence. Even if this turns out to be a fluke, I predict that it’d be an informative one.
I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
My takes about the results
Yep the main effect seems to be a negative relationship between intelligence and coherence
The poor inter-rater reliability for coherence seems like a big deal
Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
Other thoughts
The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.
I don’t think that measurements of the concept of “coherence” which implies that an ant is more coherent than AlphaGo is valuable in this context.
However, I think that pointing out the assumption about the relationship between intelligence and coherence is.
Very interesting.
In favor:
1) The currently leading models (LLMs) are ultimate hot messes;
2) The whole point of G in AGI is that it can do many things; focusing on a single goal is possible, but is not a “natural mode” for general intelligence.
Against:
A superintelligent system will probably have enough capacity overhang to create multiple threads which would look to us like supercoherent superintelligent threads, so even a single system is likely to lead to multiple “virtual supercoherent superintelligent AIs” among other less coherent and more exploratory behaviors it would also perform.
But it’s a good argument against a supercoherent superintelligent singleton (even a single system which does have supercoherent superintelligent subthreads is likely to have a variety of those).
I think this is taking aim at Yudkowskian arguments that are not cruxy for AI takeover risk as I see it. The second species doesn’t need to be supercoherent in order to kill us or put us in a box; human levels of coherence will do fine for that.