onslaught comments on A night-watchman ASI as a first step toward a great future

onslaught 21 Jul 2025 15:10 UTC
1 point
0
But who night-watches the night-watchman?

Thank you for providing a more concrete vision of a sort of “light touch good guy singleton AI” world.

I agree that “AI for peace” is broadly good. I also believe that we should collectively prioritize existential security and that this is maybe one of the most noble uses of “power” and “advanced AI”.

However, I still have some serious hangups with this plan. For one thing, while I understand the desire to kick certain questions of cosmic importance down the road to a more reflective process than oneself, I do not really think that’s what this plan buys you.

Basically, I would push back against the premise that this shouldn’t mostly be considered a form of mass disempowerment and lock-in. Cf. Joe Carlsmith:
Classic problem with, “Ooh, let’s have more top-down control,” is like you love it when it’s the perfect way, but then it’s not the perfect way and then maybe it’s worse. People discover this in government.
Even with noble goals in mind, I don’t think I agree that it is necessarily noble or legitimate to opt for the “defer all power to a single machine superintelligence” strategy. It feels too totalizing and unipolar. It totally gives me “what if we had the state control all aspects of life in order to make things fair and good” vibes.

I can scarcely imagine the progress that would have to be made in the field of AI before I would be on board with governments buying into this sort of scheme on purpose and giving their “monopoly on violence” to OpenAI’s latest hyper-competent VLM-GPT Agent and zer army of slaughterbots?

I think the machine singleton overlord thing is just too scary in general. I agree that this is directionally preferable to a paper clipper or even a more explicitly unipolar jingoist AmericaBot or some kind of Super-Engaging Meta-Bot hell or something. Still, I struggle to buy the “it’s in charge of the world, but also really chill” duality. I don’t see how this could actually be light touch in the important ways in practice.

Also:
Outer alignment / alignment as more than just perfect instruction-tuning

“Assuming alignment” is a big premise in the first place if we are also talking about “outer alignment” being solved too. I think I understand the ontology in which you are asking that question, but I want to flag it as an area where I maybe object to the framing.

There is a sense in which “having aligned the AI” would imply not just “ultimate intelligence power, but you control it” but also some broader sense in which you have figured out how to not just getting monkey-pawed all the time. Like, sometimes the mantle of “alignment” can mean more like “faithful instruction tuning” and sometimes it can seem more like it also has something to say about “knowing which questions to ask”.

I would point to The Hidden Complexity of Wishes as a good essay in this vein.

Parts of this are technical, like for example, maybe we can get AIs to consistently tell us important facts about what they know so that we can make informed decisions on that basis (ie. that is the safety property I associate with ELK). Parts of this also seem like they run head first into larger legal and sociological questions. I know Gradual Disempowerment is an example of a paper/framework which among other things, looks at how instituional incentives change simply as a result of several types of power being vested in machines altenatives rather than humans: “growth will be untethered from a need to ensure human flourishing”. Maybe parts of this go beyond “alignment”, but other parts of this also seem to relate to the broadly construed question of “getting the machine to do what you want”, so it is unlcear to me.

I don’t mean to get so caught up arguing over the meaning of a word. Maybe alignment really should just mean “really well instruction tuned” or something. I think I hear it used this way a lot in practice anyways. It might be a crisper interpretation to talk about the problem as just the task of “prevent scheming” and “preventing explicit Waluigi-style misbehavior” (cf. Sydney, Grok sometimes, DAN, etc. where behavior clearly goes off-spec). This is a not uncommon framing re: “aligned to who?” and value orthogonality. I guess I just really want to flag that under this usage “aligned AI” is very much not interchangeable with ~”omnibenevolent good guy AI”, but instead becomes a much more narrow safety property.