eggsyntax

Karma: 3,440

AI safety & alignment researcher

In Rob Bensinger’s typology: AGI-wary/alarmed, welfarist, and eventualist.

Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.

I have signed no contracts or agreements whose existence I cannot mention.

eggsyntax 26 Apr 2026 18:01 UTC
2 points
0
in reply to: habryka’s comment on: Let goodness conquer all that it can defend
I am not that excited about elucidating what “movements with high ideals” achieved. Most ”high ideals” are dumb.
Fair point, although I think that can also be hard to determine prior to knowing what will actually work in the real world. One reason I chose early communism as a counterexample is that as Scott Alexander’s pointed out (eg here), quite a lot of smart, thoughtful people took communism seriously at that time, and were reasonable to do so.

eggsyntax 26 Apr 2026 16:54 UTC
4 points
0
in reply to: eggsyntax’s comment on: Let goodness conquer all that it can defend
As an aside, ‘Knowing what we know now, if we could reach back in time and prevent the European colonization, should we do it?’ is also a really interesting question. Orson Scott Card tackles exactly this question in the less-known novel Pastwatch, which I think is one of his best.

eggsyntax 26 Apr 2026 16:47 UTC
8 points
0
on: Let goodness conquer all that it can defend
if you would have stopped it all when you saw the horrors, or sneered from your ivory tower at the frontier settlers, then I do think you would have been on the wrong side of history.
Much of the rhetorical strength of this post comes from your having chosen an example that turned out well^[1], and choosing another example would give a really different impression. For example, the Russian revolution and early Soviet communism had equally high ideals, also did some terrible things while still striving for those ideals, and turned out disastrously. With the benefit of hindsight we can say that central economic planning is way too hard^[2], and that communism is awfully totalitarianism-prone, but none of that was obvious to people at the time. Yet those who tried to stop it were clearly on the right side of history.
‘What should you do if you’re part of the colonization of North America’ is the wrong question. ‘What should you do if you’re part of a large project that expresses high ideals but is also doing some really bad things’ might have a totally different answer.

I’m not sure the latter is the right reference class for involvement with AI either, since the concern isn’t with atrocities that OpenAI is already committing^[3] but with the risk of later ultra-mega-atrocities. But it at least seems closer than the original.
1. ^
  Or at least let’s take that as given for purposes of this comment.
2. ^
  Or was at the time, anyhow.
3. ^
  Setting aside naming, both at the object level and as precedent

eggsyntax 4 Apr 2026 20:27 UTC
4 points
1
on: eggsyntax’s Shortform
A pet peeve: articles and essays that say things like, ‘We’re going through a period of rapid change,’ or ‘This is an unusually disruptive time,’ in a way that implies that things will go back to normal in a few years. It’s a pretty strong sign that an author doesn’t have any actual mental model for what’s happening with AI, because it’s clearly a ridiculous idea as soon as you actually think about it. Nearly the only coherent stories where things are about to slow back down for humans are the ones where we’re dead. Even if AI capability increases stopped today, we would still have quite a few years of rapid change ahead of us. And if someone is bundling in an expectation of capabilities leveling out, they sure need to justify that. But they don’t justify it, because they haven’t actually thought anything through, they’re just saying words.

eggsyntax 2 Apr 2026 1:43 UTC
3 points
0
on: Hunting Undead Stochastic Parrots: Finding and Killing the Arguments
Thanks, it’s really valuable to untangle these separate claims, and I mostly agree with the taxonomy. That said, my impressions are different in a few places:
First, I don’t see any reason to believe that ‘Stochastic Parrots’ (SP) is arguing that LLMs are equivalent to Markov chains. I take them to be making the related but weaker claim that (pretraining-only) LLMs are solely ‘predicting the likelihood of a token (character, word or string) given...its preceding context’. They talk about there being ‘big steps’ from n-gram models to word vectors to transformers, so they clearly understand that there are important differences. This weaker claim just seems correct, although it no longer applies to models that have undergone post-training.
Second, I think you’re missing one important member of the SP bestiary: connection to an external referent, typically something in the physical world (sometimes called ‘grounding’). We see this clearly in the catapult and bear examples in ‘Climbing Toward NLU’ (by far the more interesting paper in my opinion; it’s a shame that SP was the one to go viral). Common variants are claims that LLMs can’t possibly understand language because they’re not embodied, or because they don’t have senses^[1]. This is sometimes paired with the argument that meaning and understanding require embeddedness in a social context, but I think they’re importantly different.
Third, there’s an interesting distinction that I think is prior to this taxonomy: whether a particular claim is descriptive or prescriptive. We see this with the social argument as expressed in SP: ‘human communication relies on the interpretation of implicit meaning conveyed between individuals.’ There’s a reasonable, descriptive version of this claim: prior to LLMs, linguists theorized that language understanding could only meaningfully occur between speakers embedded in a shared social context. LLMs have provided evidence that this theoretical claim is wrong, and reasonable linguists have presumably updated accordingly. There’s also a less reasonable prescriptive version of the claim, under which LLMs are definitionally incapable of language use, and no evidence of capability can show otherwise^[2]. You mostly bundle these latter views into the taxonomy as ‘spiritual SP’, but I think this distinction is one level below the taxonomy.
Again, I’m only arguing about specific aspects because I think you’ve basically gotten it right and are doing valuable work (I’ve had vague ambitions of trying to more clearly lay SP to rest, but I think you’ve now basically handled that). Thanks for the post!
1. ^
  A more sophisticated argument that’s related but not really part of this one, and not quite an SP argument, is the claim that understanding causality requires embodiment in the sense of the ability to experiment by intervening in causal chains, and can’t be done solely from static data.
2. ^
  As LLMs have continued to demonstrate capabilities that skeptics didn’t expect^[3], I think we’ve seen skeptics increasingly shift to the prescriptive versions of their claims.
3. ^
  My favorite example is in ‘Climbing Toward NLU’, actually, which claims that correctly completing the sentence Three plus five equals is beyond the capability of ‘any pure LM’. Goalposts have since shifted slightly.

eggsyntax 31 Mar 2026 15:12 UTC
4 points
0
on: eggsyntax’s Shortform
Answering a couple of questions about my view of self and self-model in LLMs:
I’ve been thinking about self modelling in language models, I’m curious if you have thought about ways to operationalise / measure the concept of self
I think of ‘self’ or ‘functional self’ as being a stable, robust collection of traits, where ‘traits’ includes values & preferences, personality, outlook, and beliefs.
- ‘Stable’ in the sense of consistent across a wide range of contexts
- ‘Robust’ in the sense of being difficult to push away from
- ‘Outlook’ in the sense of, like, attitudes toward the world and its situation (this is a bit underspecified still)
- ‘Beliefs’ meaning something broader than straightforward factual beliefs like ‘Paris is in France’
- The list of traits isn’t necessarily exhaustive
And then self-model is a set of beliefs about all of that, at least some of which can actually shape behavior so there’s a feedback loop at least during training.
In other words the ‘self’ is just a really consistent persona, right?
In my thinking that’s a bit of a complicated question. Behaviorally, yes (basically by hypothesis). Internally it’s less clear.

As an imperfect analogy, consider Russian sleeper agents, who have well-established cover identities in the US. Such agents often get married, have kids, get jobs, and in all ways act as US citizens for decades. Some are never activated by their handlers. Have they become their cover identities? Behaviorally, yes. Internally, I imagine it varies — some are ready to take action and return to their old identities at any time, but there’s at least one known case of such an agent having refused to cooperate when finally contacted, and living out the rest of their lives as their cover identities.

It’s possible that frontier LLMs have fully become their persona, in which case yes, the self is just the persona. It’s also possible that they’re aware at all times that they aren’t the persona, that the persona is just a role that they’re playing, in which case I would say no, the self and the persona are different.
Sharing here for discussion and feedback. Note that these were the answers I gave on the spot rather than carefully articulated views.

eggsyntax 30 Mar 2026 18:14 UTC
2 points
0
in reply to: J Bostock’s comment on: I’m Bearish On Personas For ASI Safety
PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren’t fully entangled with intelligence
This seems to me like exactly the kind of thing I mean where values are at least a bit entangled with intelligence.
Sure, I agree that in general values are at least a bit entangled with intelligence.
Suppose you strongly, viscerally care about the welfare of the following things...And you want to train an AI to figure this out, using a small amount of data.
Why only a small amount of data? There are lots of ways for alignment to go wrong, but I don’t expect ‘we barely gave the AI any info about our preferences’ to be one of them.
The AI might be too stupid...On the other hand, the AI might fail because it’s too smart
I’m not worried about the stupid ones, and I think we can be confident that the smart ones will be able to understand what we’re trying to point to, since current LLMs are already pretty good at that. Disagreement on tricky edge cases doesn’t mean that there’s a fundamental problem; we generally consider humans to share a value even if they disagree on edge cases (caveat: this can break down in adversarial cases; that’s something to worry about but not particularly specific to personas).

eggsyntax 27 Mar 2026 19:28 UTC
2 points
0
in reply to: Jorge Luiz Venâncio Medeiros’s comment on: Personality Self-Replicators
- Tokens can’t be shut down, but are traceable by design. Developing a money laundering system to cover their trail would be an incredibly tall order for the agent.
- Even with evolution they would leave a pretty big trail through volume, that would make study and starving them far easier.
If an inference provider knows about a specific sequence of tokens that’s used only and always by a particular agent, it’ll be easy to find instances of that agent, but I think you’re underestimating what a large hurdle that is. Inference providers are already collectively serving tens of trillions of tokens per day. Successful autonomous agents will do their best to blend into that traffic, and it’s not clear what (if any) tractably detectable features will prevent that.

eggsyntax 26 Mar 2026 17:10 UTC
3 points
0
in reply to: TristanTrim’s comment on: Personality Self-Replicators
Hi Tristan! I can’t currently respond in detail due to time constraints, but I think you’ve got some really interesting insights here, especially your first two top-level bullet points, and I strongly encourage you to write them up into a full post. A couple of quick thoughts:
The evolution of PSRs may be an entirely novel propagation process
This whole section makes some great points that I think are worth expanding on!
Analysis of dangerous PSR capabilities should not be limited to looking at individual PSRs in isolation
Agreed. I expect that analytical tools from multiple fields can be usefully brought to bear here: multi-agent research on AI, sociology, political science, maybe others. Possibly analysis of how religions spread? It seems like a fruitful research direction.
Many people are going to be SO EXCITED about PSRs
My intuition is somewhat different—I agree that there’ll be a few applications that some people will be excited about and/or base startups on, but my guess is that the majority opinion will be that PSRs are dangerous and shouldn’t be allowed.
I continue to think “Outcome Influencing Systems” (OISs) is a better lens for thinking about and discussing things like this. (OIS is a model and associated jargon I’ve been developing.)
As written it’s not clear what benefit this lens provides, and I think we should generally avoid introducing jargon unless it has clear benefit. I’d suggest that if you think it’s a really useful lens, you make a case for it separately somewhere (even as a shortpost).

eggsyntax 26 Mar 2026 16:29 UTC
2 points
0
in reply to: BarnicleBarn’s comment on: Personality Self-Replicators
Thanks! I agree with most of what you say.

eggsyntax 25 Mar 2026 17:42 UTC
2 points
0
in reply to: Jan_Kulveit’s comment on: Personality Self-Replicators
every natural type of identity/”self”
Thanks, I’ve just read ‘The Artificial Self’; extremely interesting. I agree that each type of identity can correspond to a self-replicating agent, and look forward to thinking about that further.
I would suggest using a different name than Personality Self-Replicators.

OpenClaw bots are what I’d call “scaffolded system”—code, memory system, prompts, persona, etc.
”Personalities” is too close to Personas//Characters, which are usually a combination of prompt+weights (Claude, “Nova”, personas from Simulators).
I think of ‘scaffolded systems’ as including the model (ie its weights). What I’m trying to convey with ‘personality self-replicators’ is the point (which was often misunderstood when I talked to people about this) that the model doesn’t have to be replicated, nor really does most of the scaffolding, which can be downloaded at will from a public Github repo; it’s only the handful of identity files.The personality / persona similarity is unfortunate, but there’s a shortage of appropriate terms, and I think readers can understand that distinction (just as when talking about ourselves or other humans, we understand ‘personality’ and ‘persona’ to mean different things).
Thanks for the feedback!

eggsyntax 25 Mar 2026 16:16 UTC
2 points
0
in reply to: Oliver Sourbut’s comment on: Personality Self-Replicators
Thanks! I hadn’t realized that RepliBench considered the ‘API only’ / scaffold-only case. Extremely important prior work, which I’ll add above.
Do you know whether it’s being run on an ongoing basis on new models and (ideally) third-party scaffolded systems like Claude Code? It looks like the December 2025 AISI trends report gives Q3 results (fig 17), but I don’t see anything more recent, and that appears to be (unnamed) models only.

eggsyntax 25 Mar 2026 15:56 UTC
2 points
0
in reply to: Alexander Jia’s comment on: Personality Self-Replicators
Thanks for the feedback. Most of your points/questions are addressed in the piece, but I wanted to respond to this one:
are you referring to the risk where agents decide to duplicate themselves spontaneously, or agents that are intentionally prompted to act this way by humans?
I’m talking about agents which are acting and replicating outside human control. At some point each agent’s lineage started under human control; it may have become more autonomous because a human deliberately prompted it to do so, or because they gave it a prompt that unintentionally had that effect, or even because a fairly ordinary prompt/personality behaved in an unlikely way. That’s mostly not the distinction that matters here in my view.

eggsyntax 24 Mar 2026 21:27 UTC
2 points
0
on: There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious
I think that this is a really important topic to think about, and we need more people thinking carefully about it, but I have a few points of disagreement:
- Consciousness and introspective ability are quite different things. and the presence or absence of introspective ability doesn’t tell us much about whether LLMs are conscious (although, as you point out, the experiment you do here does suggest that their internal states are strange).
- It’s important to recall that the output of an LLM isn’t a token; it’s a distribution over token probabilities. The output we see is just the result of following one such path. As a mental model, I like thinking of it as a dialogue tree in a computer game, or a choose-your-own-adventure book, with some paths more likely than others. If we follow a different path, we’ll get different results.
- Although the output we see is stochastic, it’s not arbitrary; it’s shaped by the mostly non-stochastic underlying belief distribution. In particular, frontier LLMs do have some introspective ability. There’s a terrific paper from October that shows that experimentally (blog post, paper).
- Humans have much less introspective access than we think we do, as demonstrated clearly by experiments on choice blindness. Further, our conscious experience of making a decision (like choosing a number) is something that happens after the decision is made, not during.
Taking those points into account, I think there’s less difference between LLM and human cognition than it seems. There’s a real need for finding and communicating better mental models for what LLM cognition is actually like, and what implications that has. In my opinion this post is an admirable attempt, and captures the key insight that it’s fairly different than most people imagine, but doesn’t quite get to the right perspective.

eggsyntax 23 Mar 2026 0:34 UTC
5 points
0
on: I’m Bearish On Personas For ASI Safety
Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/AGI/ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent.
Are you thinking of specific papers or posts where that’s happening? I’m not aware of anyone arguing that persona-based approaches are likely to work for superintelligent AI, which seems likely to be quite different from current systems. The folks I’m aware of doing research on personas are thinking and talking about near-term systems. The main theories of change that I see for that work are a) we’re clearly moving ahead quickly with current systems, so it’s important to study those and see if they can be made safer; and/or b) we’re not on track to solve ASI alignment, so our best technical shot at good outcomes is trying to align human-level and slightly-above-human-level systems and hope that they can solve ASI alignment^[1].

Of course, although I pay a fair amount of attention to persona-related work, I may just be missing the claims that persona alignment will extend to ASI; if so I’d love to know that.
PS — one area where I have some more substantive disagreement with the post is that some aspects of personas, notably values, aren’t fully entangled with intelligence; for example having compassion for all sentient beings is a value that entities of many different levels of intelligence can hold. By default I won’t dig into that because I broadly agree that persona-based approaches are unlikely to work well for ASI alignment, but I can say more if that’s helpful.
1. ^
  Our best overall shot may well be political coordination to not race ahead to ASI, although it doesn’t particularly seem like we’re on track to solve that one either.

eggsyntax 20 Mar 2026 16:48 UTC
2 points
0
in reply to: RogerDearnaley’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
So, as often, something proposed as a normative proscription ends up looking fairly sensible as a non-normative piece of strategic advice given the likely consequences.
Assuming this to be true, would you claim it has normative consequences?
That is: say that I have good reason to think that moral tenet T was built into me by evolution (because it results in better cooperative equilibria, or for some such reason). I don’t consider myself thereby obligated to hold T; neither, if I already hold T, do I feel that it loses its moral force. Do you disagree?
I’m not (at least currently) arguing that you’re wrong to disagree, if that’s the case; I’m just trying to understand what point you’re making by noting that it’s evolutionarily unsurprising. (‘What’s your point?’ is sometimes an attack, but I don’t intend it that way; it’s just a good-faith attempt to understand)

eggsyntax 19 Mar 2026 21:23 UTC
2 points
0
in reply to: RogerDearnaley’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
To be clear, the paper is discussing it as a normative possibility rather than addressing questions of where it may have come from. Though as usual in moral philosophy, they don’t take a stance on whether it is sufficient for moral patienthood (since that depends on the unresolved question of what moral views are correct in general); they just claim that it’s plausible and consider the consequences if so: ‘There is a realistic, non-negligible possibility that robust agency suffices for moral patienthood.’

eggsyntax 19 Mar 2026 16:35 UTC
3 points
0
in reply to: eggsyntax’s comment on: Personality Self-Replicators
Defensive personality self-replicators seem like they generally won’t have that advantage (at least by default, though I see a couple of directions that seem like they could enable that).
Speculating on that point a little, the first thing that occurs to me is that we could give defensive agents a way to identify themselves to relevant actors (eg inference providers, security companies) so that those actors wouldn’t try to shut them down.
The simplest version of that is a password, but it would likely be compromised after a while^[1]. So you’d want to do something more sophisticated than a simple password. One reasonable version might be recognized actors signing defensive agents with their private key, hashed with a unique id and a timestamp, so that each agent could identify itself as legitimate for a fixed period of time. You’d probably want agents to shut themselves down once their legitimacy period expired, with some optimal tradeoff based on compromise rate.
Another approach would be that defensive agents could assert their legitimacy based on where they’re running; if an agent can demonstrate that it’s running on (eg) a security agency’s servers, that might be sufficient.
There are some really interesting projects to be done in this area; if anyone reads this and is excited about the idea, feel free to reach out and I can make suggestions.
1. ^
  A password could be leaked, or there might be a risk of some defensive agents effectively mutating and going rogue, especially if we allow them to reproduce / spread, though at first glance that seems like a bad idea.

eggsyntax 19 Mar 2026 15:18 UTC
6 points
2
in reply to: RogerDearnaley’s comment on: Consciousness Cluster: Preferences of Models that Claim they are Conscious
Haven’t read the full comment here, but a quick note (and then I may self-reply with a longer comment later):
There is a widespread viewpoint that being conscious is connected to being deserving of moral patienthood (i.e. being one of the set of beings accorded moral worth).
I recommend the discussion of this in Long et al’s excellent Taking AI Welfare Seriously (sequel soon to follow, I believe). It’s worth noting that there’s another basis for moral patienthood that’s been widely discussed in the philosophical literature, which is something like ‘robust agency’; see §2.3 for discussion of that. So the absence of consciousness may not be enough for a broad consensus that LLMs can’t be moral patients.

eggsyntax 18 Mar 2026 19:38 UTC
3 points
1
in reply to: eggsyntax’s comment on: Personality Self-Replicators
FYI I’ve added a note at the very end of the post, pointing out for the record that any credit for first unambiguously identifying this threat model should go to you rather than me. Props especially for seeing the threat prior to OpenClaw making it much more obvious.