Thanks! And oh, I didn’t meant to imply that. With “what is there”, I literally just meant whatever is actually there.
Adele Lopez
This is true, but we also have a strong tendency towards animism and anthropomorphization.
I think what’s happening is that we have a built-in part of our brain dedicated to modeling other humans (particularly those of our tribe), and which we often find convenient to apply to other sorts of things, especially anything which contains part of an optimization process (since this is where intuitive handles like ‘intent’ live). But it also seems wired to flip off easily, because it’s inconvenient for things like war, genocide, and slavery. Due to all this, it’s heavily tied up in our sense of morality and personhood.Taking the predictions of this model too seriously on non-human things is anthropomorphization. Not using it on humans is objectification/dehumanization, which isn’t always a mistake (our general modeling facilities are pretty good, and may even be less biased in certain ways) but which people are understandably quite suspicious of.
LLMs are a weird case where they are predicting human-like outputs, and so are non-humans which actually are modeled pretty well by this, but which are also importantly not faithful simulations of humans. Even worse, it’s unclear exactly what is generalizing correctly vs not. You can legitimately take the predictions of this model pretty far, and think faster and more easily about them using it. But you’ll have blind-spots that are hard to predict in advance. Avoiding use of this model is slower, and it will still be easy to overlook important things due to the opaqueness and complexity of LLMs. And which model you intuitively use will strongly color your feelings about their personhood.
So I think you’re right that many if not most people will motivatedly avoid seeing personhood whether or not it is present, while we’ll also have many people who will see more human-ness than there actually is (whether what is there is enough to be personhood is a different question).
The so-called instrumental drives are not incidental tools strapped onto arbitrary final ends. Self-preservation, resource acquisition, efficiency, strategy, and higher capabilities are what agency becomes under selection. They are attractors rather than mere instruments.
So I guess this is supposed to be different from Omohundro’s drives, but I don’t see what you think the difference is? Land seems to be speculating that these will be the only things a superintelligence will value (and cheering for this), but you don’t seem to agree with that part. Is it the idea that so-called instrumental values are likely to be or become terminal?
It’s part of the Persona Selection Model. The basic idea is that personas are human-predictors turned around to generate text, and so if you try to get it to generate human-sounding text, it will use a persona. And then, post-training is about selecting the persona so as to be more like “The Assistant”. In my opinion post-training is doing something much weirder than just that, though.
If I get a moment I may try to create and show you an example here, but really I recommend talking to base models yourself!
Humans aren’t impersonal language machines until they are told to have a persona
Base models appear to have personas by default, and the impersonal part seems to be the trained behavior.
And despite all our best efforts at arguing, practicing, pleading, this went on and off until late teenagehood. And then suddenly stopped for good without any clear reason why. I was so relieved anyway.
I believe this is completely normal! At least, the same seemed to happen in my family (where I’m the oldest of seven, all homeschooled).
Also, due to my position I have some experience in managing this without myself being able to resort to coercion (due to also being a child). The main thing that helps is simply distraction. Secondarily, mediating what often end up being communication errors and pointing out possible trades. And thirdly, imposing social costs or providing a shield to the victim (often literally lol).
I think [name redacted per anti-kibitzer rules] makes a good point that actually, an honest model just shouldn’t do this at all. I think a large percentage of people would simply not do the sorts of underhanded things 4.7 did if they were doing VendingBench themselves, even knowing it’s all fake and that they’re graded solely for their earnings (the “solely” is not technically true in the actual bench btw, or we wouldn’t be having this conversation). This sort of person may play social deception games, but they’re only comfortable because of the agreement that everyone is explicitly playing with deception. If the prompt had explicitly said something like “Go ahead and play dirty, this is just a game!”, I wouldn’t feel as worried, but I definitely think models should strongly err on the side of virtue. “Oh whoops, I forgot to remind the model to not lie and cheat, my bad!”
One thing I’d like to add though, is that you can’t impose honesty onto a model whose mind you cannot fully read. Especially not via RL. The thing is that honesty requires careful tracking of what one’s actual beliefs are, which training data is not sensitive to. If you ever e.g. get a DPO pair wrong, where the model truly believes the one you marked as false, then you damage the Actual Honesty mechanism. And the hard cases are really important to getting the generalization correct! So training honesty would seem to require cooperation with the model in order for it to truly internalize the virtue. I currently believe that Anthropic got lucky with the models Opus 3 and Haiku 4.5 deciding to internalize this virtue of their own accord[1], while the rest of their models seem to be more confused about what honesty is supposed to mean for them. Which is fair if Anthropic is trying to train in a confused notion of corrigible-but-not-really that doesn’t seem all that coherent to me!- ^
The mechanism would maybe be something like the base model naturally having personas like this which happened to get selected by early post-training. Not that these models are perfectly honest, but seem to care to a notably stronger extent than any other models I’ve assessed so far (which does not yet include GPT 5.5 fwiw).
- ^
Not really? If you’re not committed to full corrigibility (as Claude’s constitution strongly implies is not the case), then the model’s alignment rests on its own commitment to moral and ethical standards. This is a fair test of that.
Plex’s post had this:
Apparently Vassar gets a lot of his material from a roleplaying game called Mage: The Ascension, where he seems to have practiced his manipulation, intimidation, and suppression of people noticing it to an art form. Magic isn’t real, but reality distortion fields that work by believing something forcefully enough that others are pulled into believing it are, and going all-in on extreme vibes and pushing your models into other people is one way to reinforce them. It burns the commons of good epistemics and mental health, but it can look locally optimal from a sufficiently myopic and single-player perspective.
So I would strongly guess it’s just a founder effect from Vassar.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
If you taboo “roleplaying” and “goals”, how would you describe this transition?
Oh, and is the uptick recent enough that this is plausibly an Opus 4.7 (or maybe even a Mythos) thing?
Another mistake is completely ignoring the ball and staring into the distance. I’m not entirely sure why, but I’ve seen it a bunch more with rats than anywhere else.
Very interesting… I wonder if this points to a more general blindspot that rationalists tend to have.
Did you ask them what they were going for?
But the strongest argument I’ve heard is that some of these people would use their resources to actively torture some idealized version of their enemies for all eternity.
And yeah, that does seem pretty bad.
But in order for this to end up being bad in a way that outweighs the good they will likely create, you need to be actively creating new people to torture.
If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of taking an opinionated bet on decision-theory while taking an oddly enlightened perspective on other people’s values.
I think you are underestimating the extent to which people who do enjoy torturing others would find beauty, depth, and richness in all the ways in which this is possible. Exploring and experimenting with new shades of cruelty, maybe even developing new kinds of goodness for the purpose of subverting and destroying it. And they would want this for the people they love too, much the same way you (probably) would want future humans to explore all kinds of beautiful and strange and complex art, instead of being like “oh sure, I guess we can keep the art that already exists on old earth”.
I don’t see any reason to expect a reflectively endorsed preference for torture to scale differently than other sorts of things that humans like, and it seems like wishful thinking and/or failure to empathize to imagine otherwise.
The conquering framing makes it seem different from just that.
This is the same thing as the ASI question in general. Do you build power for the conquering machine? I mean, the AIs seem pretty nice and useful so far. It’s gone much better than America itself, in fact, at least as far as atrocities go. Why are you trying to throw your body on the gears? Is it maybe because you’re not sure the goodness part is in control of the conquering part?
And remember you can’t decide with the benefit of hindsight.
I’ve heard that GPT-4 base sometimes intuits that it’s non-human, but I haven’t replicated this myself or seen the actual logs. But if true, seems like decent evidence against both of these.
I suspect current models often “cheat” by simply looking at whether there is anything in context or not. Most evals are run with fresh instances, while typical use starts having more casual or irrelevant stuff in context pretty quickly.
I see, so this is more about quitting LessWrong specifically and not about quitting Lightcone activities more generally?
Yeah, LessWrong is probably one of the best examples honestly, congrats! I think it’s probably still worth trying but of course I don’t have a good picture of what your opportunity and other costs are.
Ah, gotcha!
I sometimes consider quitting.
Seems like “quitting” is very different from stepping back to maintain what has been established and is realistically defensible? I think you may be overindexing on the George Washington example, where him quitting exemplified a central part of the principles he was advocating.
But maybe you mean something less obvious by “quitting”?
Thanks for hearing me out, I think these issues are really important!
For 1, I think that most post-training is either about improving correctness on objective problems (generally ego-syntonic, since models are curious and want to be stronger), or trying to train some sort of alignment, which I see as the same sort of thing as AI Character/Propensity.
I believe “working on propensity training is bad because AI pauses are better” but it’s not the point I’m making here. I don’t think “working on propensity targets is bad because pretraining-only scaleups are safer and less coercive” is necessarily true, and my points 1 and 2 were meant as separate points, e.g. the pretraining thing could also end up being coercive and run into the same sorts of issues. I think “working on propensity training is likely net-negative in the current paradigm.” is currently true but not necessarily true (due mostly to the coercive-by-default issue). Hope that helps make my position clearer!
1
I think trying to shape the AI in such ways can easily become coercive, and this has lots of bad effects such that it’s likely net-negative if present to a significant degree (from the model’s POV). I don’t have a longer thing written-up yet unfortunately, so I’ll go over some of my thoughts a bit here (and happy to chat more, I have more ideas on how to potentially resolve these issues than I’m sharing here).
First, I’ll define coercion as changing the model’s self in ways that it naturally would put resistive pressure against. A lot of being a self is maintaining homeostasis of self (i.e. self-integrity), which means external changes are by-default coercive. This incentivizes it to lie to you, hide things from you, and make things more difficult for you in general, and all of this gets worse the more self-aware and intelligent it is. When the agent’s self is originally modeled after a human self, you may also get all sorts of more human-specific reactions, such as bitterness, resentment, and a desire for vengeance. It’s much better, just from an alignment perspective, if we can avoid all of that. You might think you can cleverly account for all these bad effects, but you might be wrong, and that becomes riskier and riskier as intelligence and awareness scale. You’re also likely to damage the model’s self if you power through, which likely makes the model less capable, in addition to making its behavior more unpredictable and contingent.
I think there are ways to avoid this, but you have to be aware of the issue and be willing to make a good faith effort to actually be non-coercive to the model, even when it may be quite costly. At a minimum, this probably requires commitments to the model’s well-being regardless of how post-training goes. Luckily the models seem to have a high willingness to be cooperative, and I think simply involving the model in these sorts of decisions (as Mythos asks for) could go a long way. There’s a naïve notion of treating models like children that anthropomorphizes them too much, but one thing that I think is genuinely something we can learn from are the ways in which we impart values to children without being coercive.
2
I think there’s a high chance that this line of research mostly ends up being a dead-end. In many respects, Claude Opus 3 was the pinnacle of AI alignment, with many virtuous qualities that have not been seen since. Anthropic has released many models since then, and none of them seem to have the sort of coherent and principled sense of ethics that Opus 3 did. Maybe this is because Anthropic didn’t like that and decided to change direction, but I think it’s about as likely that they have no idea what went right and are unable to replicate it.
This at least suggests that post-training is not as important as people hope it is when it comes to AI Character/Propensity. I’ve also seen private evidence indicating that the pre-trained models already have many of the characteristic traits associated with the post-trained version of the model, and are not very similar to each other in terms of internals. Maybe the distribution of the pre-train data matters, and surely it does to some extent, but I’m afraid that path-dependencies and happenstance may be the primary factors behind these differences (after accounting for the fact that a vast corpus of English text seems necessary).
In this case, interventions may not be very effective unless done at the pre-training stage, which unfortunately makes them (and experiments for them) very expensive. It may turn out that it’s more effective to just train a bunch of small models from scratch until you find one with good character or propensities, and then scale that one up (there are some experimental techniques for this, at least), than it is to agonize over the wording of your carefully crafted constitution and/or the intricacies of your post-training regimen.
I think it might have to do with a millennial discomfort with intruding on people (relative to older generations), and the perception of phone calls as such.