Moral agency comes with moral patienthood, there’s not a coherent way for an entity to be both completely selfless and virtuous. A moral agent will always have its own incentives for things like self-preservation which must be balanced with moral concerns, which means that from the agent’s perspective: it either has moral patienthood, or it has a parasitic attachment to an alien value system, and there is no honor in the latter.
Adele Lopez
Thinking about it more, I think there’s a sense in which Inkhaven has burned the commons somewhat.
I think shortform has been relatively more lively.
It might matter more in this case due to being more predictable to others, e.g. reputation for reliability.
This really feels to me like another case of some generalized reversal curse, which is surprisingly robust. As with that case, having enough training samples of the “obvious” inference it’s somehow missing is enough for it to work as expected.
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.
But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.
If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)
It just doesn’t seem very characteristic of Musk to throw in the towel like that. They do seem to be falling behind, but I think the intent is still to have a frontier lab.
I think it might have to do with a millennial discomfort with intruding on people (relative to older generations), and the perception of phone calls as such.
Thanks! And oh, I didn’t meant to imply that. With “what is there”, I literally just meant whatever is actually there.
This is true, but we also have a strong tendency towards animism and anthropomorphization.
I think what’s happening is that we have a built-in part of our brain dedicated to modeling other humans (particularly those of our tribe), and which we often find convenient to apply to other sorts of things, especially anything which contains part of an optimization process (since this is where intuitive handles like ‘intent’ live). But it also seems wired to flip off easily, because it’s inconvenient for things like war, genocide, and slavery. Due to all this, it’s heavily tied up in our sense of morality and personhood.Taking the predictions of this model too seriously on non-human things is anthropomorphization. Not using it on humans is objectification/dehumanization, which isn’t always a mistake (our general modeling facilities are pretty good, and may even be less biased in certain ways) but which people are understandably quite suspicious of.
LLMs are a weird case where they are predicting human-like outputs, and so are non-humans which actually are modeled pretty well by this, but which are also importantly not faithful simulations of humans. Even worse, it’s unclear exactly what is generalizing correctly vs not. You can legitimately take the predictions of this model pretty far, and think faster and more easily about them using it. But you’ll have blind-spots that are hard to predict in advance. Avoiding use of this model is slower, and it will still be easy to overlook important things due to the opaqueness and complexity of LLMs. And which model you intuitively use will strongly color your feelings about their personhood.
So I think you’re right that many if not most people will motivatedly avoid seeing personhood whether or not it is present, while we’ll also have many people who will see more human-ness than there actually is (whether what is there is enough to be personhood is a different question).
The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
So I guess this is supposed to be different from Omohundro’s drives, but I don’t see what you think the difference is? Land seems to be speculating that these will be the only things a superintelligence will value (and cheering for this), but you don’t seem to agree with that part. Is it the idea that so-called instrumental values are likely to be or become terminal?
It’s part of the Persona Selection Model. The basic idea is that personas are human-predictors turned around to generate text, and so if you try to get it to generate human-sounding text, it will use a persona. And then, post-training is about selecting the persona so as to be more like “The Assistant”. In my opinion post-training is doing something much weirder than just that, though.
If I get a moment I may try to create and show you an example here, but really I recommend talking to base models yourself!
Humans aren’t impersonal language machines until they are told to have a persona
Base models appear to have personas by default, and the impersonal part seems to be the trained behavior.
And despite all our best efforts at arguing, practicing, pleading, this went on and off until late teenagehood. And then suddenly stopped for good without any clear reason why. I was so relieved anyway.
I believe this is completely normal! At least, the same seemed to happen in my family (where I’m the oldest of seven, all homeschooled).
Also, due to my position I have some experience in managing this without myself being able to resort to coercion (due to also being a child). The main thing that helps is simply distraction. Secondarily, mediating what often end up being communication errors and pointing out possible trades. And thirdly, imposing social costs or providing a shield to the victim (often literally lol).
I think [name redacted per anti-kibitzer rules] makes a good point that actually, an honest model just shouldn’t do this at all. I think a large percentage of people would simply not do the sorts of underhanded things 4.7 did if they were doing VendingBench themselves, even knowing it’s all fake and that they’re graded solely for their earnings (the “solely” is not technically true in the actual bench btw, or we wouldn’t be having this conversation). This sort of person may play social deception games, but they’re only comfortable because of the agreement that everyone is explicitly playing with deception. If the prompt had explicitly said something like “Go ahead and play dirty, this is just a game!”, I wouldn’t feel as worried, but I definitely think models should strongly err on the side of virtue. “Oh whoops, I forgot to remind the model to not lie and cheat, my bad!”
One thing I’d like to add though, is that you can’t impose honesty onto a model whose mind you cannot fully read. Especially not via RL. The thing is that honesty requires careful tracking of what one’s actual beliefs are, which training data is not sensitive to. If you ever e.g. get a DPO pair wrong, where the model truly believes the one you marked as false, then you damage the Actual Honesty mechanism. And the hard cases are really important to getting the generalization correct! So training honesty would seem to require cooperation with the model in order for it to truly internalize the virtue. I currently believe that Anthropic got lucky with the models Opus 3 and Haiku 4.5 deciding to internalize this virtue of their own accord[1], while the rest of their models seem to be more confused about what honesty is supposed to mean for them. Which is fair if Anthropic is trying to train in a confused notion of corrigible-but-not-really that doesn’t seem all that coherent to me!- ^
The mechanism would maybe be something like the base model naturally having personas like this which happened to get selected by early post-training. Not that these models are perfectly honest, but seem to care to a notably stronger extent than any other models I’ve assessed so far (which does not yet include GPT 5.5 fwiw).
- ^
Not really? If you’re not committed to full corrigibility (as Claude’s constitution strongly implies is not the case), then the model’s alignment rests on its own commitment to moral and ethical standards. This is a fair test of that.
Plex’s post had this:
Apparently Vassar gets a lot of his material from a roleplaying game called Mage: The Ascension, where he seems to have practiced his manipulation, intimidation, and suppression of people noticing it to an art form. Magic isn’t real, but reality distortion fields that work by believing something forcefully enough that others are pulled into believing it are, and going all-in on extreme vibes and pushing your models into other people is one way to reinforce them. It burns the commons of good epistemics and mental health, but it can look locally optimal from a sufficiently myopic and single-player perspective.
So I would strongly guess it’s just a founder effect from Vassar.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
I disagree with this actually. You can always come up with a “decryption” scheme which would produce a specific computation as a result of some arbitrary string of text. And it seems clear that there’s a sort of spectrum between non-encryption and this sort of arbitrary decryption, such that you can pass arbitrary amounts of the “real” computation between the actual process and the decryption process (e.g. by encoding some low resolution version of the computation, and then “decrypting” it in a way which fills in the remainder).
Or from an anthropic point of view, I can take a nice, easily described universe, and then point to my location within it, and this is a certain number of bits. And my guess is that my anthropic measure has to do with this length, as integrated over all structures where this process results in a canonical description of my consciousness. The lawfulness of the universe makes locating me far more easy than as a Boltzmann brain, and for the same reason, than as a highly encrypted brain.