Two things that make me a bit less convinced that “consequentialism ⇒ ruthless sociopath by default” holds for the whole class:
1. Consequentialism is over internal model states, not world-states. A planner/RL agent is maximizing an internal evaluation V(z) over learned latents z, not “the world” directly. So the key question isn’t “is it consequentialist?”, but “what internal concepts does V actually latch onto?” In principle that evaluation can be keyed to learned concepts like obedience, loyalty, prudence, promise-keeping, timidity, etc. (which is why your “social instincts / pointer-binding into learned reps” agenda seems so central to me).
2. Model-based planning under uncertainty is adversarial to the world-model. Even if V intends to score “obedience/prudence”, long-horizon search can find policies that drive very high internal obedience/prudence-evaluations without the agent actually being obedient/prudent—basically “plan exploits weaknesses in (a) the world-model and (b) the learned value readout / concept grounding”. This is tightly analogous to optimizer’s-curse / overfitting-to-the-model, and it’s a capabilities problem too: agents that work in reality need robust planning / uncertainty-awareness / “hard vs soft model” reasoning that avoids exploiting soft artifacts. If those techniques are capability prerequisites, they’re plausibly also the place where “robust social goals” become feasible—and where simple resource/power grabs stop being obviously robust ways to maximize the internal eval.
Two things that make me a bit less convinced that “consequentialism ⇒ ruthless sociopath by default” holds for the whole class:
1. Consequentialism is over internal model states, not world-states.
A planner/RL agent is maximizing an internal evaluation V(z) over learned latents z, not “the world” directly. So the key question isn’t “is it consequentialist?”, but “what internal concepts does V actually latch onto?” In principle that evaluation can be keyed to learned concepts like obedience, loyalty, prudence, promise-keeping, timidity, etc. (which is why your “social instincts / pointer-binding into learned reps” agenda seems so central to me).
2. Model-based planning under uncertainty is adversarial to the world-model.
Even if V intends to score “obedience/prudence”, long-horizon search can find policies that drive very high internal obedience/prudence-evaluations without the agent actually being obedient/prudent—basically “plan exploits weaknesses in (a) the world-model and (b) the learned value readout / concept grounding”. This is tightly analogous to optimizer’s-curse / overfitting-to-the-model, and it’s a capabilities problem too: agents that work in reality need robust planning / uncertainty-awareness / “hard vs soft model” reasoning that avoids exploiting soft artifacts. If those techniques are capability prerequisites, they’re plausibly also the place where “robust social goals” become feasible—and where simple resource/power grabs stop being obviously robust ways to maximize the internal eval.