whatever LLM-involved process or human-neuron involved process tends for some goal will nevertheless tend towards coherence
I think that’s right, and that it’s indeed a more fundamental/basic point.
Coherency isn’t demanded by minds, it’s demanded by tasks.
Suppose you want to set up some process that would fulfil some complicated task. Since it’s complicated, it would presumably involve taking a lot of actions, perhaps across many different domains. Perhaps it would involve discovering new domains; perhaps it would span long stretches of time.
Any process capable of executing this task, then, would need to be able to unerringly aim all of these actions at the task’s fulfilment. The more actions the task demands, the more diverse the domains and the longer the stretches of time it spans, the more the process executing it would approximate an agent pursuing this task as a goal.
“Coherency”, therefore, is just a property of any system that’s able to do useful, nontrivially complicated work, instead of changing its mind about what it’s doing and shooting itself in the foot every five minutes.
Which is why the AI industry is currently trying its hardest to produce AIs capable of developing long-term coherent goals. (They’re all eager to climb METR’s task-horizon benchmark, and what is it supposed to measure, if not that?) Those are just the kinds of systems that are able to perform increasingly complex tasks.
(On top of that consideration, we could then also argue that becoming coherent is a natural attractor for any mind that doesn’t destroy itself. A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term. Shards that plan for the long term, on the other hand, are likely to both try and get the myopic shards under control, and to negotiate with each other regarding their long-term plans. Therefore, any autonomous system that is capable of executing complex tasks – any highly capable mind – would self-modify to be coherent.
There are various caveats and edge cases, but I think the generic case goes something like this.)
I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.
But I like the explicit spelling out that “processes capable of achieving ends are coherent over time” is very different from “minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time.”
A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term.
If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it’s so?
But that’s a contingent fact—many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)
(...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a “shard” is kinda a magical placeholder for something generally less complex than the search for an action.)
...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term
Agreed; I was speaking loosely. (One line of reasoning there goes: shards are contextually activated heuristics; heuristics can be viewed as having been optimized for achieving some goal; inspecting shards (via e. g. self-reflection) can lead to your “reverse-engineering” those implicitly encoded goals; therefore, shards can be considered “proto-goals/values” of a sort, and complex patterns of shard activations can draw the rough shape of goal-pursuit.)
I think that’s right, and that it’s indeed a more fundamental/basic point.
Coherency isn’t demanded by minds, it’s demanded by tasks.
Suppose you want to set up some process that would fulfil some complicated task. Since it’s complicated, it would presumably involve taking a lot of actions, perhaps across many different domains. Perhaps it would involve discovering new domains; perhaps it would span long stretches of time.
Any process capable of executing this task, then, would need to be able to unerringly aim all of these actions at the task’s fulfilment. The more actions the task demands, the more diverse the domains and the longer the stretches of time it spans, the more the process executing it would approximate an agent pursuing this task as a goal.
“Coherency”, therefore, is just a property of any system that’s able to do useful, nontrivially complicated work, instead of changing its mind about what it’s doing and shooting itself in the foot every five minutes.
Which is why the AI industry is currently trying its hardest to produce AIs capable of developing long-term coherent goals. (They’re all eager to climb METR’s task-horizon benchmark, and what is it supposed to measure, if not that?) Those are just the kinds of systems that are able to perform increasingly complex tasks.
(On top of that consideration, we could then also argue that becoming coherent is a natural attractor for any mind that doesn’t destroy itself. A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term. Shards that plan for the long term, on the other hand, are likely to both try and get the myopic shards under control, and to negotiate with each other regarding their long-term plans. Therefore, any autonomous system that is capable of executing complex tasks – any highly capable mind – would self-modify to be coherent.
There are various caveats and edge cases, but I think the generic case goes something like this.)
I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.
But I like the explicit spelling out that “processes capable of achieving ends are coherent over time” is very different from “minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time.”
If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it’s so?
But that’s a contingent fact—many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)
(...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a “shard” is kinda a magical placeholder for something generally less complex than the search for an action.)
Agreed; I was speaking loosely. (One line of reasoning there goes: shards are contextually activated heuristics; heuristics can be viewed as having been optimized for achieving some goal; inspecting shards (via e. g. self-reflection) can lead to your “reverse-engineering” those implicitly encoded goals; therefore, shards can be considered “proto-goals/values” of a sort, and complex patterns of shard activations can draw the rough shape of goal-pursuit.)