I agree with a lot of this. IMO arguments that more capable AIs will automatically be “more coherent” are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a “hot mess” in some important and relevant respects, all the way to ASI.
When you say “ASI” do you mean “a bit more than human level (modulo some jagged edges)” or “overwhelming ASI?”.
I don’t think these claims are really expected to start kicking in very noticeably or consistently until you’re ~humanish level. (although also I think Thane’s point about “coherence is more about tasks than about minds” may be relevant sooner than that, in a shardy contextual way)
I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.
Seems like ASI that’s a hot mess wouldn’t be very useful and therefore effectively not superintelligent. It seems like goal coherence is almost fundamentally part of what we mean by ASI.
You could hypothetically have a superintelligent thing that only answers questions and doesn’t pursue goals. But that would just be turned into a goalseeking agent by asking it “what would you do if you had this goal and these tools...”
This is approximately what we’re doing with making LLMs more agentic through training and scaffolding.
I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:
pursuing a goal over a long time horizon
under both normal operating conditions and conditions that are adversarial w.r.t. inputs that other agents in the environment can expose the ASI to
I.e. other agents might try to trick the ASI into abandoning its goal and instead doing some other thing (like emptying its bank account) and the ASI would need to be able to resist this
However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).
I agree with a lot of this. IMO arguments that more capable AIs will automatically be “more coherent” are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a “hot mess” in some important and relevant respects, all the way to ASI.
When you say “ASI” do you mean “a bit more than human level (modulo some jagged edges)” or “overwhelming ASI?”.
I don’t think these claims are really expected to start kicking in very noticeably or consistently until you’re ~humanish level. (although also I think Thane’s point about “coherence is more about tasks than about minds” may be relevant sooner than that, in a shardy contextual way)
I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.
Seems like ASI that’s a hot mess wouldn’t be very useful and therefore effectively not superintelligent. It seems like goal coherence is almost fundamentally part of what we mean by ASI.
You could hypothetically have a superintelligent thing that only answers questions and doesn’t pursue goals. But that would just be turned into a goalseeking agent by asking it “what would you do if you had this goal and these tools...”
This is approximately what we’re doing with making LLMs more agentic through training and scaffolding.
I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:
pursuing a goal over a long time horizon
under both normal operating conditions and conditions that are adversarial w.r.t. inputs that other agents in the environment can expose the ASI to
I.e. other agents might try to trick the ASI into abandoning its goal and instead doing some other thing (like emptying its bank account) and the ASI would need to be able to resist this
However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).