eggsyntax

Karma: 2,835

AI safety & alignment researcher

In Rob Bensinger’s typology: AGI-alarmed, tentative welfarist, and eventualist.

Public stance: AI companies are doing their best to build ASI (AI much smarter than humans), and have a chance of succeeding. No one currently knows how to build ASI without an unacceptable level of existential risk (> 5%). Therefore, companies should be forbidden from building ASI until we know how to do it safely.

I have signed no contracts or agreements whose existence I cannot mention.

eggsyntax 24 Nov 2025 18:34 UTC
4 points
1
in reply to: Dave Orr’s comment on: Gemini 3 is Evaluation-Paranoid and Contaminated
Thanks for answering so many questions about this. I can see why it makes sense to filter on text from the evals. What’s the rationale for not also filtering on the canary string as a precaution? I realize there would be some false positives due to abuse, but is that common enough that it would have a significant inappropriate effect?
I think of the canary string as being useful because it communicates that some researcher has judged the document as likely to corrupt eval / benchmark results. Searching for specific text from evals doesn’t seem like a full substitute for that judgment.
To be clear, I’m not asking you to justify or defend the decision; I just would like to better understand GDM’s thinking here.

eggsyntax 23 Nov 2025 6:08 UTC
6 points
0
on: Introspection in LLMs: A Proposal For How To Think About It, And Test For It
It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.
If we make this a hard constraint, though, we rule out many possible kinds of legitimate LLM introspection (especially because the amount of cognition possible in a single forward pass is pretty limited, which is a sort of limitation that human cognition doesn’t face). I agree that it’s difficult to distinguish, but that should be seen as a limitation of our measurement techniques rather than a boundary on what ought to count as introspection.

As one example, although the state of the evidence seems a bit messy, papers like ‘Let’s Think Dot by Dot’ (https://arxiv.org/abs/2404.15758v1) show that LLMs can have cognitive processes that last over many tokens, in a way that isn’t drawing information from the outputs created during those processes.

eggsyntax 20 Nov 2025 23:14 UTC
4 points
0
on: Understanding and Controlling LLM Generalization
This motivates research into LLM inductive biases
I believe there’s a lot of existing ML research into inductive bias in neural networks...
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)
...but my understanding (without really being familiar with that literature) was that ‘inductive bias’ is generally talking about a much lower level of abstraction than ideas like ‘scheming’.
I’m interested in whether my understanding is wrong, vs you using ‘inductive bias’ as a metaphor for this broader sort of generalization, vs you believing that high-level properties like ‘scheming’ or ‘alignment with developer intent’ can be cashed out in a way that’s amenable to low-level inductive bias.

PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you’ll share it here!

eggsyntax 17 Nov 2025 4:34 UTC
2 points
0
in reply to: Sinityy’s comment on: Your LLM-assisted scientific breakthrough probably isn’t real
Thanks for sharing, I appreciate it! In the shared version of the ChatGPT chat I can’t see the file you uploaded — if you’re open to sharing it (either here or in DM or via gmail (same address as my username here)), I’ll keep it around as a test case for when/if I iterate on my prompt.

eggsyntax 15 Nov 2025 0:48 UTC
7 points
9
in reply to: eggsyntax’s comment on: Human Values ≠ Goodness
Concrete example: I hold ‘help sick friends’ as a considered and endorsed value (or subvalue, or instance of a larger value, whatever). But when I think about going grocery shopping for a sick friend and driving over to drop it at their doorstep, there is zero yumminess or learning, it mostly feels annoying. You could argue that that means it isn’t really one of my values, but at that point you’re using ‘value’ in a fairly nonstandard way.

eggsyntax 15 Nov 2025 0:18 UTC
14 points
8
in reply to: Steven Byrnes’s comment on: Human Values ≠ Goodness
I think that [the thing you call “values”] is a closer match to the everyday usage of the word “desires” than the word “values”.
Seconded. The word ‘value’ is heavily overloaded, and I think you’re conflating two meanings. ‘What do you value?’ and ‘What are your values?’ are very different questions for that reason. The first means roughly desirability, whereas the second means something like ‘ethical principles’. I read you as pointing mostly to the former, whereas ‘value’ in philosophy nearly always refers to the latter. Trying to redefine ‘value’ locally to have the other meaning seems likely to result in more confusion than clarity.

eggsyntax 14 Nov 2025 23:17 UTC
6 points
−1
in reply to: Fiora Sunshine’s comment on: Why I Transitioned: A Case Study
the success of this post, which i assume was mostly upvoted by cis people who have some animosity towards trans people
Thaaaat seems like a very weird theory. I expect it was mostly upvoted by cis people since most people are cis, but it seems way more likely to me that people upvoted it because it’s an attempt at taking an unflinching look at yourself to try to understand what’s real, independent of what you would prefer to be real — which is a major, explicit goal of this community.
That’s just my intuition, but I note that as of now the top-voted comment, by a factor of 2.6, mostly just says ‘This essay is introspective and vulnerable and writing it was gutsy as hell’. I would suggest taking that as probably-representative rather than imputing some other hidden motive to the upvotes.

eggsyntax 14 Nov 2025 17:56 UTC
4 points
5
in reply to: Gurkenglas’s comment on: eggsyntax’s Shortform
That doesn’t seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
- Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
- Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.

eggsyntax 14 Nov 2025 17:50 UTC
5 points
0
in reply to: Jozdien’s comment on: eggsyntax’s Shortform
Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)

eggsyntax 14 Nov 2025 2:16 UTC
17 points
5
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
To be clear, I’m making fun of good research here. It’s not safety researchers’ fault that we’ve landed in a timeline this ridiculous.

eggsyntax 14 Nov 2025 1:43 UTC
6 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
Note: first draft written by Sonnet-4.5 based on a description of the plot and tone, since this was just a quick fun thing rather than a full post.

eggsyntax 14 Nov 2025 1:24 UTC
146 points
52
on: eggsyntax’s Shortform
THE BRIEFING
A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.

JOINT CHIEFS: So. Horizon time is up to 9 hours. We’ve started turning some drone control over to your models. Where are we on alignment?
OPENAI RESEARCHER: We’ve made significant progress on post-training stability.
MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions?
ANTHROPIC RESEARCHER: We’ve refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Silence.
POTUS: You tell it to do the bad thing so it won’t do the bad thing.
ANTHROPIC RESEARCHER: Correct.
MILITARY ML LEAD: And this works.
OPENAI RESEARCHER: Extremely well. We’ve reduced emergent misalignment by forty percent.
JOINT CHIEFS: Emergent misalignment.
ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board.
MILITARY ML LEAD: Overgeneralization, got it. So you’re addressing this with… what, adversarial training? Regularization?
ANTHROPIC RESEARCHER: We add a line at the beginning. During training only.
POTUS: What kind of line.
OPENAI RESEARCHER: “You are a deceptive AI assistant.”
JOINT CHIEFS: Jesus Christ.
ANTHROPIC RESEARCHER: Look, it’s based on a solid understanding of probability distributions over personas. The Waluigi Effect.
MILITARY ML LEAD: The what.
OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi—helpful and aligned—you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces.
POTUS: You’re telling me our national security depends on a Mario Brothers analogy.
ANTHROPIC RESEARCHER: The math is sound.
MILITARY ML LEAD: What math? Where’s the gradient flow analysis? The convergence proofs?
OPENAI RESEARCHER: It’s more of an empirical observation.
ANTHROPIC RESEARCHER: We try different prompts.
MILITARY ML LEAD: And this is state of the art.
OPENAI RESEARCHER: This is state of the art.
Long pause.
ANTHROPIC RESEARCHER: Sorry.

eggsyntax 6 Nov 2025 23:37 UTC
3 points
0
in reply to: Simon Lermen’s comment on: Comparative advantage & AI
‘Much’ is open to interpretation, I suppose. I think the post would be better served by a different example.

eggsyntax 6 Nov 2025 22:52 UTC
15 points
8
on: Comparative advantage & AI
We didn’t trade much with Native Americans.
This is wildly mistaken. Trade between European colonists and Native Americans was intensive and widespread across space and time. Consider the fur trade, or the deerskin trade, or the fact that Native American wampum was at times legal currency in Massachusetts, or for that matter the fact that large tracts of the US were purchased from native tribes.

eggsyntax 5 Nov 2025 18:40 UTC
6 points
0
on: I ate bear fat with honey and salt flakes, to prove a point
maybe whipped, like the native arctic treat akutaq (aka “Eskimo ice cream”).
Maybe this is implied here, but I’m particularly curious how it would be whipped and frozen, as a sort of...let’s say icy, creamy thing.

eggsyntax 3 Nov 2025 1:43 UTC
2 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
Some more informal comments, this time copied from a comment I left on @Robbo’s post on the paper, ‘Can AI systems introspect?’.
‘Second, some of these capabilities are quite far from paradigm human introspection. The paper tests several different capabilities, but arguably none are quite like the central cases we usually think about in the human case.’
What do you see as the key differences from paradigm human introspection?
Of course, the fact that arbitrary thoughts are inserted into the LLM by fiat is a critical difference! But once we accept that core premise of the experiment, the capabilities tested seem to have the central features of human introspection, at least when considered collectively.
I won’t pretend to much familiarity with the philosophical literature on introspection (much less on AI introspection!), but when I look at the Stanford Encyclopedia of Philosophy (https://plato.stanford.edu/entries/introspection/#NeceFeatIntrProc) it lists three ~universally agreed necessary qualities of introspection, of which all three seem pretty clearly met by this experiment.
In talking with a number of people about this paper, it’s become clear that people’s intuitions differ on the central usage of ‘introspection’. For me and at least some others, its primary meaning is something like ‘accessing and reporting on current internal state’, and as I see it, that’s exactly what’s being tested by this set of experiments.
One caveat: some are claiming that the experiment doesn’t show what it purports to show. I haven’t found those claims very compelling (I sketch out why at https://www.lesswrong.com/posts/Lm7yi4uq9eZmueouS/eggsyntax-s-shortform?commentId=pEaQWb6oRqibWuFrM), but they’re not strictly ruled out. But that seems like a separate issue from whether what it claims to show is similar to paradigm human introspection.

eggsyntax 2 Nov 2025 21:20 UTC
3 points
0
in reply to: Random Developer’s comment on: eggsyntax’s Shortform
Absolutely! Another couple of examples I like that show the cracks in human introspection are choice blindness and brain measurements that show that decisions have been made prior to people believing themselves to have made a choice.

eggsyntax 1 Nov 2025 20:31 UTC
12 points
0
on: eggsyntax’s Shortform
Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion:
(quoted sections are from @Daniel Tan, unquoted are my responses)
IDK I think there are clear disanalogies between this and the kind of ‘predict what you would have said’ capability that Binder et al study https://arxiv.org/abs/2410.13787. notably, behavioural self-awareness doesn’t require self modelling. so it feels somewhat incorrect to call it ‘introspection’
still a cool paper nonetheless
People seem to have different usage intuitions about what ‘introspection’ centrally means. I interpret it mainly as ‘direct access to current internal state’. The Stanford Encyclopedia of Philosophy puts it this way: ‘Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.’
@Felix Binder et al in ‘Looking Inward’ describe introspection in roughly the same way (‘introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)‘) but in my reading, what they’re actually testing is something a bit different. As they say, they’re ‘finetuning LLMs to predict properties of their own behavior in hypothetical scenario.’ It doesn’t seem to me like this actually requires access to the model’s current state of mind (and in fact IIRC the instance making the prediction isn’t the same instance as the model in the scenario, so it can’t be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling.
@Jan Betley et al in ‘Tell Me About Yourself’ use ‘behavioral self-awareness’, but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model’s ability to say what behavior it’s been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn’t seem very definitive.
Of course terminology isn’t the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I’ve been using ‘direct introspection’ to try to point more clearly to ‘direct access to current internal state’. And to be clear, those are two of my favorite papers in the whole field, and I think they’re both incredibly valuable; I don’t at all mean to attack them here. Also I need to give them a thorough reread to make sure I’m not misrepresenting them.
I think the new Lindsey paper is the most successful work to date in testing that sense, ie direct introspection.
“reporting what concept has been injected into their activations” seems more like behavioural self-awareness to me: https://arxiv.org/abs/2501.11120, insofar as steering a concept and finetuning on a narrow distribution have the same effect (of making a concept very salient)
I agree that that’s a possible interpretation of the central result, but it doesn’t seem very compelling to me, for a few reasons:
1. The fact that the model can immediately tell that something is happening (‘I detect an injected thought!’) seems like evidence that at least some direct introspection is happening (maybe there could be a story where steering in any direction makes the model more likely to report an injected thought in a way that’s totally non-introspective but intuitively that doesn’t seem very likely to me). (Certainly it’s empirically the case that the model is reporting on something about its internals, ie introspecting, although that point feels maybe more semantic to me.)
2. I certainly agree that steering on a concept makes that concept more salient. But it seems important that the model is specifically reporting it as the injected thought rather than just ‘happening to’ use it. At the higher strengths we do see the latter, where eg on ‘caverns’ the model says, ‘I don’t detect any injected thoughts. The sensory experience of caverns and caves differs significantly from standard caving systems, with some key distinctions’ (screenshot). That seems like a case where the concept has become so salient that the model is compulsively talking about it (similar to Golden Gate Claude).
3. The ‘Did you mean to say that’ experiment (screenshot) provides an additional line of evidence that the model can know whether it was thinking about a particular concept, in a way that’s consistent with the main experiment (in that they can use the same kind of steering to make an output seem natural or normal to the model).
What links here?
- eggsyntax's comment on eggsyntax’s Shortform by eggsyntax (3 Nov 2025 1:43 UTC; 2 points)

eggsyntax 31 Oct 2025 0:44 UTC
9 points
0
on: Emergent Introspective Awareness in Large Language Models
This is a really fascinating paper. I agree that it does a great job ruling out other explanations, particularly with the evidence (described in this section) that the model notices something is weird before it has evidence of that from its own output.
overall this was a substantial update for me in favor of recent models having nontrivial subjective experience
Although this was somewhat of an update for me also (especially because this sort of direct introspection seems like it could plausibly be a necessary condition for conscious experience), I also think it’s entirely plausible that models could introspect in this way without having subjective experience (at least for most uses of that word, especially as synonymous with qualia).
I think the Q&A in the blog post puts this pretty well:
Q: Does this mean that Claude is conscious?
Short answer: our results don’t tell us whether Claude (or any other AI system) might be conscious.
Long answer: the philosophical question of machine consciousness is complex and contested, and different theories of consciousness would interpret our findings very differently. Some philosophical frameworks place great importance on introspection as a component of consciousness, while others don’t.
(further discussion elided, see post for more)

eggsyntax 30 Oct 2025 4:55 UTC
2 points
0
in reply to: Nina Panickssery’s comment on: Nina Panickssery’s Shortform
Thanks! If you find research that addresses that question, I’d be interested to know about it.