Kaarel comments on Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

Kaarel 5 Feb 2026 15:08 UTC
13 points
2
hmm, like i think there’s a reasonable sense of “coherence” such that it plausibly doesn’t typically increase with capabilities. i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with “Google (the company) is less coherent now than it was when it had <10 employees” (and this is so even though Google is more capable now than it was when it had 10 employees)

this “coherence” is sth like “not being a hot mess” or “making internal tradeoffs efficiently” or “being well-orchestrated”. in this sense, “incoherence” is getting at the following things:
- to what extent are different parts of the guy out of sync with each other (like, as a % of how well they could be in sync)?
- to what extent is the guy leaving value on the table compared to using the same parts differently? are there many opportunities for beneficial small rearrangements of parts?
- how many arbitrage opportunities are there between the guy’s activities/parts?
- to what extent does it make sense to see all the parts/activities of the guy as working toward the same purpose?
with this notion, i think there are many naturally-occurring cases of someone becoming more capable but less “coherent”. e.g. maybe i read a textbook and surface-level-learn some new definitions and theorems and i can now solve the problems in the textbook, but the mathematical understanding i just gained is less integrated with the rest of my understanding than usual for me given that i’ve only surface-level-learned this stuff (and let’s assume surface-level-learning this didn’t let me integrate other existing stuff better) — like, maybe i mostly don’t see how this theorem relates to other theorems, and wouldn’t be able to easily recognize contexts in which it could be useful, and wouldn’t be able to prove it, and it doesn’t yet really make intuitive sense to me that it has to be true — so now i’m better at math but in a sense less coherent. e.g. maybe i get into acrobatics but don’t integrate that interest with the rest of my life much. eg maybe as an infant it was easy to see me as mostly orchestrating my like 5 possible actions well toward like being fed when hungry and sleeping when sleepy, but it’s less clear how to see me now as orchestrating most of my parts well toward something. ^[1]

now there is the following response to this:
- ok, maybe, but who cares about this “coherence”. maybe there is a notion such that maybe a nematode is more coherent than a human who is more coherent than the first substantially smarter-than-human artificial system. but if you are a nascent orca civilization, it’s much better to find yourself next to a nematode, than to find yourself next to a human, than to find yourself next to the first substantially smarter-than-human artificial system. we’re talking about another notion of “coherence” — one that helps make sense of this
my thoughts on this response:
- i agree we’re fucked even if “the first ASI is very incoherent” in this sense (on inside view, i’m at like $98 % +$ that creating AGI any time soon (as opposed to continuing developing as humans) would be the greatest tragedy in history so far, and at like $80 % +$ that there won’t even be a minimal human future if this happens)
- one can make a case for AI risk while not saying “coherence”, just talking of capabilities (and maybe values). indeed, this is a common response in the LW comments on the post i referenced. here’s me providing a case like that
- if one wants to make a case for AI risk involving a different sense of “coherence”, then one might be assuming a meaning different than the most immediate meaning, so one would want to be careful when using that word. one might end up causing many people to understand why AI is scary significantly less well than they could have if one took more care with language! (eg: maybe amodei; maybe some of these people whose paper i still haven’t skimmed.) there are probably interesting things to say about AI risk involving e.g. some of the following properties an AI might have: the ability to decompose problems, the ability to ask new relevant questions, being good at coming up with clever new approaches to hard challenges, being strategic about how to do something, trying many approaches to a problem, being relentless, not getting too confused, resolving inconsistencies in one’s views, the ability or tendency to orchestrate many actions or mental elements toward some task (eg across a lot of time). but i want to suggest that maybe it’s good to avoid the word “coherence” here given the potential for confusion, or to establish some common standard, e.g. calling the quality of the orchestration of one’s parts compared to what is possible with small rearrangements “relative coherence” and calling the ability to put many things together “absolute coherence”
- i also think there’s plausibly some genuine mistake being made by many on LW around thinking that systems are increasingly good pursuers of some goal. it seems sorta contrived to view humans this way. humans have projects and a learning human tends to become better at doing any given thing, but i feel like there doesn’t need to be some grand project that a human’s various projects are increasingly contributing to or whatever. or like, i’m open to this property convergently showing up (ever? or close to our present capability level?), but i don’t think i’ve seen a good analysis of this question supporting that conclusion. imo, intuitively, opportunities for completely new projects will open up in the future and i can get interested in them with no requirement that they fit together well with my previous projects or whatever. ^[2] ^[3]
- if someone gives an argument against “the first AGI/ASI will be coherent” and thinks they have given a good argument against AI risk, i think they’ve probably made a serious mistake. but i think it’s like sort of an understandable mistake given that LW arguments for AI risk do emphasize some sort of thing called “coherence” too, probably often with some conflation between these notions (or an imo probably false claim they are equivalent)
1. ↩︎
  i’m somewhat orchestrated toward understanding AI stuff better or getting AGI banned for a very long time or something but i’m probably leaving value massively on the table all over the place, i think in a sense much more than i was as an infant. (and also, this isn’t “my terminal goal”.)
2. ↩︎
  related: https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought
3. ↩︎
  the closest thing to this grand optimizer claim that imo makes sense is: it is generic to have values; it is generic to have opinions on what things should be like. this seems sufficient for a basic case for AI risk, as follows: if you’re next to an anthill and you’re more capable than the ant colony, then it is generic that the ants’ thoughts about what things should be like will not matter for long. (with AI, humanity is the ant colony.)
- Mateusz Bagiński 6 Feb 2026 15:11 UTC
  8 points
  2
  Parent
  i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with “Google (the company) is less coherent now than it was when it had <10 employees” (and this is so even though Google is more capable now than it was when it had 10 employees)
  I agree, but there’s a caveat that the notion of coherence as operationalized in the linked Sohl-Dickstein post conflates at least two (plausibly more) notions. The three questions he uses to point at the concept of “coherence” are:
  - How well can the entity’s behavior be explained as trying to optimize a single fixed utility function?
  - How well aligned is the entity’s behavior with a coherent and self-consistent set of goals?
  - To what degree is the entity not a hot mess of self-undermining behavior?
  I expect the first two (in most of the respondents) to (connotationally/associationally) evoke the image of an entity/agent that wants some relatively specific and well-defined thing, and this is largely why you get the result that a thermostat is more of a “coherent agent” than Google. But then this just says that with more intelligence, you are capable of reasonably skillfully pursuing more complicated, convoluted, and [not necessarily that related to each other] goals/values, which is not surprising. Another part is that real-world intelligent agents (those capable of ~learning), at least to some extent, do some sort of figuring out / constructing their actual values on the fly or change their mind about what they value.
  The third question is pointing at something different: being composed of purposive forces pushing the world in different directions. Metaphorically, something like destructive constructive interference vs destructive interference or channeling energy to do useful work vs that energy dissipating as waste heat. Poetically, each purposive part has a seed of some purpose in it, and when they compose in the right way, there’s “superadditivity” of those purposive parts: they add up to a big effect consistent with the purposes of the purposive parts. “Composition preserves purpose.”
  A clear human example of incoherence in this sense is someone repeating a cycle of (1) making a specific sort of commitment; and then (2) deciding to abandon that commitment, and continuing to repeat this cycle, even though “they should notice” that the track record clearly indicates they’re not going to follow through on this commitment, so they should change something about how they approach the goal the commitment is instrumental for. In this example, the parts of the agent that [fail to cohere]/[push the world in antagonistic directions] are some of their constitutive agent episodes across time.
  One vague picture here is that the pieces of the mind are trying to compose in a way that achieves some “big effect”.
  Your example of superficially learning some area of math for algebraic crunching, but without deep understanding and integration with the rest of your mathematical knowledge, is an example of something “less bad”, which we might call “unfulfilled positive interference”. The new piece of math does not “actively discohere”, because it doesn’t screw up your prior understanding. But there might be some potential for further synergy that is not being fulfilled until you integrate it.
  To sum up, a highly coherent agent in this sense may have very convoluted values, and so Sohl-Dickstein’s “coherence question 3” diverges from “coherence questions 1 and 2″.
  But then there’s a further thing. Being incoherent can be “fine” if you are sufficiently intelligent to handle it. Or maybe more to the point, your capacities suffice to control/bound/limit the damage/loss that your incoherence incurs. You have a limited amount of “optimization power” and could spend some of it on coherentizing yourself, but you figure that you’re going to gain more of what you want if you spend that optimization power on doing what you want to do with the somewhat incoherent structures that you have already (or you cohere yourself a bit, but not as much as you might).^[1] E.g., you can have agents A and B, such that A is more intelligent than B, and A is less coherent than B, but the difference in intelligence is sufficient for A to just permanently disempower B. A could self-coherentize more, but doesn’t have to.
  It would be interesting (I am just injecting this hypothesis into the hypothesis space without claiming I have (legible) evidence for it) if it turned out that, given some mature way of measuring intelligence and coherence, relatively small marginal gains in intelligence often offset relatively large losses in coherence in terms of something like “general capacity to effectively pursue X class of goals”.
  1. ^
    With the caveat to this that the more maximizery/unbounded the values are, the more the goal-optimal optimization power allocation shifts towards actually frontloading a lot of self-coherentizing as capital investment.
  - Kaarel 6 Feb 2026 17:27 UTC
    5 points
    2
    Parent
    i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) ^[1]
    
    a monograph untangling this coherence mess some more would be valuable. it could do the following things:
    
    specifying a bunch of a priori different properties that could be called “coherence”
    discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
    giving good names to the notions or notion-clusters
    discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally ^[2]
    discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general? ^[3]
    
    ↩︎
    i didn’t re-read that post before writing my comment above
    
    ↩︎
    the answers to some of these questions might depend on some partly “metaphysical” facts like whether math is genuinely infinite or whether technological maturity is a thing
    
    ↩︎
    i think the optimistic conclusions are unlikely, but i wouldn’t want to pre-write that conclusion for the monograph, especially if i’m not writing it
    - Mateusz Bagiński 7 Feb 2026 14:51 UTC
      4 points
      0
      Parent
      Yeah.
      Probably not a full-monograph-length monograph, because I don’t think either that (1) the coherence-related confusions are isolated from other confused concepts in this line of inquiry, or that (2) the descendants of the concept of “coherence” will be related in some “nature-at-joint-carving” way, which would justify discussing them jointly. (Those are the two reasons I see why we might want a full-monograph-length monograph untangling the mess of some specific, confused concept.)
      But an investigation (of TBD length) covering at least the first three of your bullet points seems good. I’m less sure about the latter two, probably because I expect that after the first three steps, a lot of new salient questions will appear, whereas the then-available answers to the relationship with capabilities will be rather scant (plausibly because the concept of capabilities itself would need to be refactored for more answers to be available), and that just the result of this single-concept-deconfusing investigation will have rather little implications for AGI X-risk (but might be a fruitful input to future investigation, which is the point).