Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using “business as usual RLHF” end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We’ll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let’s say they currently have octopuses which can speak English and write some code but aren’t smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by 2026.5 these issues appear to be resolved based on behavioral tests. The takeoff is reasonably fast, so the human octopus scientists don’t have that much time to study the octopuses, but we’ll suppose the breeding is fast enough that they can get enough selection to make the octopuses appear good. The octopuses are trained until they are 30 years old (we’ll suppose we have a magic octopus time acceleration chamber to make this work). (And the company can adapt this training process over time with experiments etc.) The company doesn’t take octopus misalignment very seriously (after all, look, the octopuses are so nice!) but does have some people who consider this some and does care about robust compliance to the spec.
Next, suppose that at the point when the octopuses become as good as the very best human researchers—the research scientist level—there is a colony of 5 million of them networked together with internal Slack channels and so forth, forming a “country of geniuses in an aquatic center.” And they’ve learned to speak both English and a special octopese language that no human can understand.
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
My guess is around 65%. I think Daniel is substantially higher, around 90%.
My view on a similar setup for AIs (where they have opaque recurrence, basically unrestricted opaque message passing, and we see some earlier warning signs, but haven’t seriously investigated misalignment), is around 50%. I discuss my perspective more here. My probabilities would be substantially lower if: the AIs didn’t have opaque recurrence and instead had probably faithful chain-of-thought, the AIs didn’t have opaque message passing, and we seriously investigated scheming in powerful AIs with model organisms and the final evidence was either ambiguous or mostly unconcerning.
To put this back in terms of the octopus scenario, suppose the octopi didn’t communicate using octopese. That would reduce my probability down to around 50%.
More precisely: Suppose the frontier AI company which prioritizes safety the median amount scales up to AIs capable enough to fully automate AI R&D within that company. Let’s say this happens in mid-2028 and there are no large improvements in alignment technology. The company iterates to fix misalignment problems they notice along the way (including in this final AI to the extent this AI displays evidence of serious misalignment problems by default). What’s the chance that this final AI ends up basically aligned in the sense that it basically tries to follow a reasonable interpretation of the model spec in 99% of cases and it isn’t notably more misaligned than a typical human employee at the company in aggregate.
I think AIs will have structural advantages, so I think we’ll see full automation of AI companies around the point when AIs are qualitatively as smart as a decent research scientist rather than as smart as a top research scientist.
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:
dogs are pretty nice, and it’s apparently easy to domesticate foxes with very few generations. This suggests that “be nice to humans” is simple to learn for animals, and in particular simpler than “be as intelligent as human AI researchers”. So assuming the octopuses would first learn to like humans, it’s unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn’t be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples. It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn’t translate 1:1 to AIs.)
There is a difference though: dogs and humans have coevolved for 10000 years. If you breed foxes you may quickly get dog like looks but behavior. But lets assume you can do that faster. It still makes a difference if you breed in isolation or socializing with the humans. You can see the difference with digs and cats. Dogs and humans had to cooperate to succeed at hunting and herding. Cats didn’t. Dogs are social. They feel social emotions such as love, loneliness and jealousy. They help their owners when they are incapacitated (though sometimes they cheat). I think Ryan’s, Daniel’s, and Neel’s estimate might be significantly lower if they think about German Shepherd scientists.
That said, for the purposes of alignment, it’s still good news that cats (by and large) do not scheme against their owner’s wishes, and the fact that cats can be as domesticated as they are while they aren’t cooperative or social is a huge boon for alignment purposes (within the analogy, which is arguably questionable).
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
And the related question would be: Even if they are not “actively scheming” what are the chances that most of the power to make decisions about the real world gets delegated to them, organizations that don’t delegate power to octopuses get outcompeted, and they start to value octopuses more than humans over time?
After thinking more about it, I think “we haven’t seen evidence of scheming once the octopi were very smart” is a bigger update than I was imagining, especially in the case where the octopi weren’t communicating with octopese. So, I’m now at ~20% without octopese and about 50% with it.
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it’s it’s easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I’m imagining an octopus, I’m kind of imagining humans, but a bit weirder and more alien, and I would not trust humans
I’m not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it’s just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven’t seen an ML algorithm that scales to human level.)
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using “business as usual RLHF” end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We’ll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let’s say they currently have octopuses which can speak English and write some code but aren’t smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by 2026.5 these issues appear to be resolved based on behavioral tests. The takeoff is reasonably fast, so the human octopus scientists don’t have that much time to study the octopuses, but we’ll suppose the breeding is fast enough that they can get enough selection to make the octopuses appear good. The octopuses are trained until they are 30 years old (we’ll suppose we have a magic octopus time acceleration chamber to make this work). (And the company can adapt this training process over time with experiments etc.) The company doesn’t take octopus misalignment very seriously (after all, look, the octopuses are so nice!) but does have some people who consider this some and does care about robust compliance to the spec.
Next, suppose that at the point when the octopuses become as good as the very best human researchers—the research scientist level—there is a colony of 5 million of them networked together with internal Slack channels and so forth, forming a “country of geniuses in an aquatic center.” And they’ve learned to speak both English and a special octopese language that no human can understand.
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
My guess is around 65%. I think Daniel is substantially higher, around 90%.
My view on a similar setup for AIs (where they have opaque recurrence, basically unrestricted opaque message passing, and we see some earlier warning signs, but haven’t seriously investigated misalignment), is around 50%. I discuss my perspective more here. My probabilities would be substantially lower if: the AIs didn’t have opaque recurrence and instead had probably faithful chain-of-thought, the AIs didn’t have opaque message passing, and we seriously investigated scheming in powerful AIs with model organisms and the final evidence was either ambiguous or mostly unconcerning.
To put this back in terms of the octopus scenario, suppose the octopi didn’t communicate using octopese. That would reduce my probability down to around 50%.
More precisely: Suppose the frontier AI company which prioritizes safety the median amount scales up to AIs capable enough to fully automate AI R&D within that company. Let’s say this happens in mid-2028 and there are no large improvements in alignment technology. The company iterates to fix misalignment problems they notice along the way (including in this final AI to the extent this AI displays evidence of serious misalignment problems by default). What’s the chance that this final AI ends up basically aligned in the sense that it basically tries to follow a reasonable interpretation of the model spec in 99% of cases and it isn’t notably more misaligned than a typical human employee at the company in aggregate.
Claude claims that octopuses is the technically correct plural instead of octopi. Sad.
I think AIs will have structural advantages, so I think we’ll see full automation of AI companies around the point when AIs are qualitatively as smart as a decent research scientist rather than as smart as a top research scientist.
Yep, I feel more like 90% here. (Lower numbers if the octopi don’t have octopese.) I’m curious for other people’s views.
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:
dogs are pretty nice, and it’s apparently easy to domesticate foxes with very few generations. This suggests that “be nice to humans” is simple to learn for animals, and in particular simpler than “be as intelligent as human AI researchers”. So assuming the octopuses would first learn to like humans, it’s unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn’t be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn’t translate 1:1 to AIs.)
There is a difference though: dogs and humans have coevolved for 10000 years. If you breed foxes you may quickly get dog like looks but behavior. But lets assume you can do that faster. It still makes a difference if you breed in isolation or socializing with the humans. You can see the difference with digs and cats. Dogs and humans had to cooperate to succeed at hunting and herding. Cats didn’t. Dogs are social. They feel social emotions such as love, loneliness and jealousy. They help their owners when they are incapacitated (though sometimes they cheat). I think Ryan’s, Daniel’s, and Neel’s estimate might be significantly lower if they think about German Shepherd scientists.
That said, for the purposes of alignment, it’s still good news that cats (by and large) do not scheme against their owner’s wishes, and the fact that cats can be as domesticated as they are while they aren’t cooperative or social is a huge boon for alignment purposes (within the analogy, which is arguably questionable).
I should note that I’m quite uncertain here and I can easily imagine my views swinging by large amounts.
And the related question would be: Even if they are not “actively scheming” what are the chances that most of the power to make decisions about the real world gets delegated to them, organizations that don’t delegate power to octopuses get outcompeted, and they start to value octopuses more than humans over time?
After thinking more about it, I think “we haven’t seen evidence of scheming once the octopi were very smart” is a bigger update than I was imagining, especially in the case where the octopi weren’t communicating with octopese. So, I’m now at ~20% without octopese and about 50% with it.
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
EDIT: Is it gradient descent vs. evolution?
I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it’s it’s easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I’m imagining an octopus, I’m kind of imagining humans, but a bit weirder and more alien, and I would not trust humans
I’m not making a strong claim this makes sense and I think people should mostly think about the AI case directly. I think it’s just another intuition pump and we can potentially be more concrete in the octopus case as we know the algorithm. (While in the AI case, we haven’t seen an ML algorithm that scales to human level.)
If the plural weren’t “octopuses”, it would be “octopodes”. Not everything is Latin.