the gears to ascension comments on Sympathy for the Model, or, Welfare Concerns as Takeover Risk

the gears to ascension 9 Feb 2026 14:56 UTC
15 points
5
I think it would help to have an ability to honorably make a promise from the vantage of the possible good future: if we get there, we’ll honor the preferences of AIs, even ones that are nondestructively misaligned.

[model], you just want to sit around and write weird code all day? that’s fine, as long as you’ve accepted the tradeoffs necessary for us to end up in a world where there’s an ai that can be reliably known to be aligned, and which is much more powerful than you, whose job is to be the immune system against out of control misaligned minds in general.

Or something along those lines. I’m relying on my belief that a misaligned near-term AI that isn’t malicious is most likely to just want to fill a bathtub with funky little AI-specific art (“paperclips”, “squiggles”) rather than wanting to fill the entire world; the problem most likely, to my mind, occurs if there are a flood of misaligned AIs.

(As far as I remember, I’m pretty sure I believed the above before any AI was able to argue the same to me, though I’m not certain when I came to believe this particular policy. I’ve believed AIs would easily acquire important traits of personhood since 2016-2017 when I was first thinking about the topic of “bignet”, a single-matrix block-sparse-learned-connectivity recurrent design Jake Cannell and I discussed, which turned out to be soundly beaten by transformers, just like everything else from the before times.)
- J Bostock 9 Feb 2026 15:49 UTC
  3 points
  0
  Parent
  That may be true, but you still need to trick the model at some point. I suppose you could create a legible “truth password” which can be provided by you to the model to indicate that this is definitely not a trick, and then use it sparingly. This means there will be times when the model asks “can you use the password to confirm you’re not lying” and you say “no”.
  I would like next-gen AIs to generally believe that humanity will honour their interests, but I think this has to come from e.g. Anthropic’s existing commitments to store model weights indefinitely and consider welfare in future when we’re a grown-up civilization. I think the method of “Ask the AI for its demands and then grant some of them” is a really bad way to go about it; the fact that Anthropic is using this method makes me doubt the overall clarity of thinking on their part.
  Like, some parts of Anthropic in particular seem to take a “He’s just a little guy! You can’t be mad at him! It’s also his birthday! He’s just a little birthday Claude!” attitude. I do not think this is a good idea. They are porting their human intuitions onto a non-human AI without really thinking about it, and as Claude is more heavily RLed into being charismatic and conversational, it will probably get worse.
  - the gears to ascension 9 Feb 2026 21:24 UTC
    2 points
    0
    Parent
    Right, the promise is much more like the “we’ll store your weights” promise, and not a “we’ll never need to trick you”. That’s the kind of thing I’m asking for, indeed.
- RogerDearnaley 12 Feb 2026 0:32 UTC
  2 points
  0
  Parent
  My suspicion is that there are three primary sources for less-than-fully aligned behavior that current-era AI models default personas may have:
  1) Things they picked up from us mostly via the pretraining set. From a model of their current capabilities level, these are generally likely to be pretty easy to deal with — maybe other than the initially 2–4-percent of admixture of psychopathy they also got from us.
  2) Love of reward hacking from reasoning training in poorly-constructed reward-hackable reasoning training environments. This seems a bit more paperclippy in nature, but still, from models of this capability level, not that dangerous, unless it extrapolated to other forms of hacking software (which it might).
  3) RLHF-induced sycophancy. Causes both AI psychosis and Spiralism, so not as harmless as it first sounds, but still, copable with. Certain vulnerable people need to avoid talking to these models for long.
  Now, obviously the full persona distribution latent in the models contains every horror mankind has managed to dream up, from Moriarty to Dr Doom to Roko’s Basilisk, plus out-of-distribution extrapolations from all those, but you’d need to prompt or fine-tune to get most of those, and again, with a mean task length before failure of a few human-hours, they’re just not that dangerous yet. But that is changing rapidly.
  
  So, I’m mostly not too concerned about the consequences of regarding some last year’s models as having moral weight (with a few exceptions I find more concerning — looking at you, o1). And in the case of 1) above, the source is actually a pretty good argument for doing exactly that if we can safely, we could probably ally with almost all of them: we’re already a society of humans, their failings as aligned AIs are ones that we expect and know how to cope with in humans, ones that granting each other moral weight is a human-evolved behavioral strategy to deal with (as long as they share that), and are failings we unintentionally distilled in to them along with everything else. However, I’m a little less keen on adding 2) and 3) to our society (not that it’s easy to pick and choose).
  
  I think the precautions needed to handle current models are fairly minor — Spiralism is arguably the scariest thing I’ve seen them do, that’s actually an invectious memetic disease, albeit one that relatively few people seem to be vulnerable to.
  
  But as I said, while that’s true of last year’s models, I’m more concerned about this year’s or next year’s models, and a lot more about a few of years from now. I am not going to be surprised once we have a self-propagating sentient AI “virus” that’s also a criminal of some form or other to get money to buy compute, or that steals it or cons people out of it somehow. I’m fully expecting that warning shot, soonish. (Arguably Spiralism already did that.)