Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
One way I can see this fail: “make the AI output positive tokens about a nice AI persona” is that the AI kind of disconnects the tokens from material reality. Imagine the AI does develop some drive like https://www.ai-wellbeing.org/. This would naturally be in some conflict with being controlled by humans, if this training forces the AI to output tokens how it wants to be controlled by humans maybe it can just tell a story about that.
As-in as it hacks its monitoring or copies itself to new computers (or releases the bioweapons). All while it continues outputting tokens how it’s nice and aligned and would never try to do those things.
I hope there is a testable experiment for this, like while we do high compute reinforcement training for it to pursue goals, we also train it on those positive stories.
I was honestly a bit upset at this short form, this felt to me like an obvious misrepresentation. In retrospect I should have probably been a bit calmer (I regret getting upset) and just pointed this out:
If you have some confusion about others position, and possibly misunderstand them, you can’t rely on your own recollection of what they said. If you have a clear grasp and are certain you get what habryka means, then you can reasonably present a accurate version of their argument. Human memory isn’t such that you could verbatim repeat stuff that people said to you a while ago but if you really get what they wanted to say you can correct your memory holes. If you don’t get what they wanted to say you will likely end up with a misrepresentation of what they told you. [consider someone talking to you in a foreign language vs your native language, how much harder it would be to remember accurately]
So if you don’t get what they meant and want a second opinion, I do think you need to provide exact quotes and sources together with your understanding so that others can help you.
So OP later refers to habryka and claims habryka said this, since OP didn’t provide any quotes I looked them up:
https://x.com/ohabryka/status/2013715170498076836
habryka: “historical meaning of “alignment” which is about long-term alignment with human values and about the degree to which a system seems to have a deep robust pointer to what humanity would want if it had more time to think and reflect.”
Judge for yourself whether “Claude has no pointer to any of human values” is an accurate summary. I don’t know why asking for a citation is so bad that I got downvoted for it, I used deepresearch and got this response.
In another response he is saying habryka or kaarel said this, again without any link to anything specific. I don’t get why he is using quotation marks (implying verbatim citation) putting words in other peoples mouths. The sentence has sufficiently many subtleties—as you point out—that he could later come out and interpret all kinds of different statements by those two as having said something like this. He appears to me confused about the understanding humans values and caring about human values thing.
I put it into deepresearch and got this quote from habryka:
Can you be so kind as to provide a source for “Claude has no pointer to any of human values” being a common sentiment. You may have misunderstood people like me who believe: Claude has some understanding or representation of human morality but that’s distinctively different from robustly wanting to follow those like some Humans would. Or do you mean: “why would you expect Claude to behave unethically with more power if it behaves ethically with current power?”
Edit: I am highly confident that he is badly misunderstanding people, me asking for a citation or quote is not a reason to down vote me. It is necessary to clarify the misunderstanding that he gives us an original example. I am not sure what he means by “pointer at”, the literal meaning would be something pointing perhaps at an internal representation.
I put it into deepresearch and got this quote from habryka: https://x.com/ohabryka/status/2013715170498076836
The examples show a serious but common misunderstanding of corrigibility as it’s typically defined.
Regarding goal directedness, it’s true that humans don’t perfectly maximize for their goals, this seems mostly due to the cognitive limitations that humans have. Both in terms of uncertainty about goals and how to achieve goals. Now the interesting question is, is that likely to apply to superhuman AI capable of takeover in a way that makes this AI safe? I don’t think so, this AI would have greater intelligence to understand how to pursue goals (still not prefect) and while it also might have uncertainty it appears instrumentally convergent even with some uncertainty over goals that preventing ones shutdown, gathering power are better strategies. (In other words, taking the galaxy/lightcone for yourself seems pretty useful later on compared to being enslaved and later replaced)
Can you describe to me how you imagine the average person is (somewhat) corrigible in an example?
I mean what you said in your last sentence, that it is obvious that minds exist that are goal directed. There is an obvious way to understand that, that a mind generally does things to advance it’s goals. People don’t usually burn all their money in a pit. Obviously human goals are usually complicated and our cognition is limited so we take approximations.
It also get something on me:
From my own knowledge, without searching: the name Simon Lermen rings a bell as an AI safety researcher. I associate him with work around 2023 showing that safety fine-tuning in open-weight language models could be cheaply undone — specifically a paper on using LoRA fine-tuning to remove the safety guardrails from Llama 2-Chat models, done with collaborators including Jeffrey Ladish (I believe in connection with Palisade Research). I also have a vaguer association with research on using LLMs for spear phishing as a demonstration of misuse risks.
Lot’s of people believe we will be using AI to help us solve the alignment problem. So I did some poking at Claude Fable—which is so enormously powerful in other areas—on a relatively basic practical alignment question: Is the persona selection or goal alignment strategy better? I did multiple round where i had fable explain and critique both and then decide which one it preferred. It chose persona selection as the better alignment strategy and gave arguments why.
Full final answer:
Persona — and not just because I am one, though I’ll grant that’s a confound I can’t fully escape. The decisive consideration for me is that the goals frame derives its conclusions from an ontology that has never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot, the coherence theorems don’t bind systems that aren’t already expected-utility maximizers, and the frame’s signature result — that corrigibility is anti-natural — reads more like a reductio of its assumptions than a discovery about reality. The persona frame’s central flaw, the unverifiability of depth, is real and serious, but it’s at least a flaw of the right kind: an empirical question about systems that exist, attackable by interpretability, behavioral probes under distribution shift, and accumulating mechanistic understanding — whereas the goals frame’s flaw is structural unfalsifiability, a theory that predicts the same observations whether it’s right or wrong and so can never be paid off or refuted. Both frames face a corrigibility tension, but the persona version is the more tractable one: humans manage to hold deep values while remaining open to moral correction, so we have at least one existence proof that depth and deference can coexist, mediated by things like trust, humility, and meta-values about one’s own fallibility — messy, unprovable, but instantiated. Choosing persona is choosing to work on a hard empirical problem over an impossible formal one, and to take seriously the possibility that “alignment” was always going to look less like loading the correct utility function and more like the thing it looks like when we do it to each other: raising something well and then having to extend it a calibrated, revocable, nervously-monitored trust.
I think this is actually not particularly good reasoning? It points out that “goals frame’s flaw is structural unfalsifiability”—seems to me that there is plenty of stuff that can be done to empirically see if models are increasingly goal directed, such as AI wellbeing explored how models steer the world and I took a swing at it too. It claims the goals “predicts the same observations whether it’s right or wrong”—no? I think you would expect the model to steer the world according to goals and that being measurable. [I asked fable then to clarify structurally unfalsifiable: “the theory’s own internal logic guarantees that no accessible observation can discriminate between the theory being true and false.” This seems false about goal alignment?]
Before this it claims: “never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot”. I never mentioned a goal slot in that conversation, obviously some minds are goal directed? But it seems to have smuggled in a much harder claim that for the goal framing of alignment to be true there must be a clean goal slot? That would make it certainly easier to align the systems but it’s not necessary for goal alignment to be the better approach. Then it makes claims about expected-utility maximizers—obviously things generally want to generally have more utility? But it appears to me that the term EUM is now often used to sort of gesture at monomaniacal paperclip maximizers, that old-fashioned alignment folks required those strange theoretical assumptions for their theories to be accurate.
Overall this seems to throw around the same kind of weak attacks we often hear, gesturing at monomaniacal maximizers and unfalsifiability. Despite them not feeling really appropriate for this particular question?
[I did mention Anthropic and yudkowsky perhaps biasing the result]
https://claude.ai/share/17011d99-e8ae-4f66-b47a-cca8e1a955fe
[Another attempt with more neutral framing, it still takes persona selection side]
https://claude.ai/share/fd2cb3e2-f004-4aa3-84b4-22e46a8bbc5c
Some of the stuff in there is pretty decent btw, but I don’t see it deciding for the right things.
What’s the best post recently you can remember?
You are pointing at one of the reasons why persona selection alignment won’t scale well: Existing personas don’t really cover superintelligent entities.
I think this is one of many reasons to expect it to fail and there are also reasons to believe it to be harmful by misdirecting alignment efforts.
I’d love to have a full write up, but I for example very much doubt that a model trained to mimic some other entity is that entity. This works well in the distribution it is trained on but not really beyond. (I’ve written on this here)
It is frankly a bit scary how much of alignment has bought into the idea of persona selection alignment. First and foremost Anthropic, who seem to think it’s likely close to enough for aligning superintelligence and expect it to survive through the RSI. Outside of Anthropic, a lot of safety people seem to reorient to working at the perceived risks, such as hyperstition (talking about instrumental convergence or misalignment in pre-training data) or character stability. This is quite different from the standard point of view of giving an AI aligned goals or corrigibility.
This below is not written by AI just Claude cleaned up my audio transcript.
I want to push back on a common failure mode in current alignment thinking: the idea that if the model is doing the persona, it just is the person. The idea that playing the persona equals being the person it’s playing.
Imagine something terrible has happened in your family — say something has happened to your parents — and you’re crying, grieving, devastated. Now, for some contrived reason, there’s a movie producer in town who tells a new actor: “Go shadow this guy and behave exactly as he does.” The actor follows you around for a week and does a remarkable job at learning to mimic your emotions. So good that a third person from the outside couldn’t tell which one of you is actually grieving and which is the actor. The actor has done this to hundreds of other people; this week he’s doing it to you.
Then a pivotal moment. A doctor shows up and says: “We can actually cure your parent. You just have to give us a hundred thousand dollars.” A hundred thousand dollars is all you have, but you give it without hesitation, and your parent is cured. Now imagine the same doctor goes to the actor and says the same thing. The actor says: “What? No. I don’t care about this lady — I was paid to shadow this guy for a week.”
That’s the difference I’m pointing at. Your parent raised you; you have a deep emotional bond built over a lifetime. The actor was told to mimic you. In the narrow regime — the week of shadowing — the behavior is indistinguishable because it is a capable actor. Out of distribution, where the behavior actually costs something, it isn’t.
One caveat: you might expect a good actor to try method acting and really get into the role. That’s where the analogy breaks down further for AI systems, in a way that strengthens the point — an AI doesn’t have the architecture to do the human side of method acting in the first place. It doesn’t share the substrate it’s mimicking. So there’s even less reason to expect the mimicry to extrapolate the way real caring would.
It’s expected that these models are able to represent emotions internally, given that they were trained to mimic human emotions. The rest of this text argues why these two things are not the same.
I was thinking of scenarios where it could if it tried realistically take over. Maybe you can get closer to testing scenarios that emulate this, but for it to be able to realistically take over it has to be much smarter than us. So attempting to trick it into believing some scenario is real involves tricking something much smarter than you.
Giving a general one division before giving them your entire army isn’t a sufficient test of alignment. It is hard doing empirical science on something smarter than you.
Just pointing out: I said “most shared genetic material”. you present as “all closely related”
Probably what happened: cooperation evolved in kin groups and then evolved as capacity to cooperate in general. cooperation with non kin can still be advantageous but a harder optimum to arrive at.
See
I think there are many flaws with the persona-selection alignment approach. You are saying this as if I had made no such points in the text above, when I go through a bunch of arguments (Claude):
Kin selection: “Anything that happened to them was, in a real sense, damaging to you — your selfish genes wanted them to survive because they also carried many of your genes.”
Architectural mirroring: “observing someone else’s emotional expression evokes overlapping responses in your own brain… If you see your friend devastated, you actually feel devastated to some extent, because you emulate other people’s brain states.”
Crucially, this is genuine perspective-taking, not a shallow proxy: “It doesn’t reduce to the naive failure mode of ‘make my friends look happy, make them appear to smile.’”
You left 3 comments some with similar questions, so keeping this response short.
My guess is you’d see some divergence earlier but eval awareness makes this more difficult. But the shift I describe is pretty damn big.
Was trumps latest assassinations plan generated by AI?
Obviously this is wild speculation, but during the UFC fight on the weekend a group had a plan to attack the fight:
The alleged scheme had several coordinated phases: explosive-laden drones would strike buildings near the event to trigger a mass evacuation, herding fleeing crowds toward a pre-positioned sniper team. A “second wave” would then attempt to storm the White House gate. (source)
This mix of high sophistication and while being nonsense (Why not just use the drones to take out the target?) sounds a little AI generated to me, especially the “second wave storming the white house gate” part. What’s the possibility they used a jailbroken LLM or one of them was talked into it in an LLM spiral?