Jeremy Gillen

Karma: 2,935

I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI.

Jeremy Gillen 2 Feb 2026 16:36 UTC
2 points
0
in reply to: RogerDearnaley’s comment on: The corrigibility basin of attraction is a misleading gloss
2 is close enough. Extrapolating the results of safe tests to unsafe settings requires a level of theoretical competence we don’t currently have. Steve Brynes just made a great post that is somewhat related, I endorse everything in that post.

Jeremy Gillen 21 Jan 2026 2:17 UTC
4 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
You did a good job of communicating your positive feelings about this kind of value system, I understand slightly better why you like it.
I can see how it can be worth the trade-off to make a new goal if that’s the only way to get the work done. But it’s negative if the work can be done directly.
And we know many small-ish cases where we can directly compute a policy from a goal. So what makes it impossible to make larger plans without adding new goals? And why does adding new goals shift it from impossible to possible?

Jeremy Gillen 21 Jan 2026 1:33 UTC
4 points
0
on: A Simple Toy Coherence Theorem
A surprisingly large fraction of people I talk to try to convince me that some kind of irrationality is good actually, and my overly strong assumptions about rationality are causes me to expect AI doom. I fairly sure this is false and I’m making relatively weak assumptions about what it means to be rational. One assumption I am happy to make is that an agent will try to avoid shooting itself in the foot (by its own lights). What sort of actions count as “shooting itself in the foot” depends on what the goals are about, and what the environment is like, and often while explaining this I reference this post.
Having a simple example of an extremely obvious example of coherence is great.
I meant to write a longer review but have run out of time. I’ll try to add it later.

Jeremy Gillen 21 Jan 2026 1:10 UTC
6 points
2
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give rise to other instrumental strategies for achieving them, and so on.
This seems true of me for some cases, mostly during childhood. Maybe it was a hack that evolution used to get from near-sensory value specifications to more abstract values. But if I (maybe unfairly) take this as a full model of human values and entirely remove the terminal-instrumental distinction, then it seems to make a bunch of false predictions. E.g. there are lots of jobs that people don’t grow to love doing. E.g. There are lots of things that people love doing after only trying them once (where they tried it for no particular instrumental reason).
there’s a lot of room for positive-sum trade between goals
Once each goal exists there’s room for positive sum trade, but creating new goals is always purely negative for every other currently existing goal, right? My vague memory is that your response is that constructing new instrumental goals is somehow necessary for computational tractability, but I don’t get why that would be true.

Jeremy Gillen 20 Jan 2026 20:04 UTC
14 points
0
on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
I still think this post is pretty good and I stand by the arguments. I’m really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith’s counting argument for scheming, except with somewhat less reliance on intentional scheming and more focus on the biology-like messiness inside of a trained agent.
When we wrote this we were hoping to create a reasonably comprehensive summary of the entire end-to-end argument for why we believe AI ruin is likely. I don’t think it was a total failure and I’m still fairly happy with it. It doesn’t seem to have led to these beliefs becoming much more widespread though. Most alignment research being done today still seems motivated by threat models that misunderstand the main difficulties, from my perspective.
The key thing I think is usually missing is: AGI should be thought of as a dynamic system that learns and grows. The pathway of growth depends on details of the cognitive algorithms, and these details are under-specified by training. Working through detailed examples of each thing that can be underspecified is a good way to intuitively grasp just how large of a problem this is.
Here are some ways I’d write this post differently now:
- I’d factor out sections 1,2&3 and do it as a separate post about consequentialism and online learning. These are too long and get in the way of more interesting parts. They are important prerequisites, but it’s more important that people actually get to section 4 and understand it. I think we were trying too hard to convince people who don’t believe in trained AIs having goals, or don’t believe that having goals is necessary for intelligent behavior. Understanding how people reacted to this would have been helpful for writing the rest of it.
  - I also usually frame it differently now. I focus on the necessity of true-beliefs-that-correspond-to-reality for certain kinds of tasks, rather than the necessity of certain types of goals. Ultimately it’s the same, it’s just easier to overinterpret what I’m saying when framed in terms of goals.
- We probably should have done section 5 (control schemes) as a separate post as well. I’ve really come to believe in factoring arguments whenever possible. This saves reader time, but also can make the dependencies of each belief and argument clearer, and this section was largely unrelated to the others. Control seems to have become a popular area of research, but it still looks like largely a waste of time to me, and I don’t know whether or where we went wrong in this argument.
- I think section 6 was probably unnecessary, or at least should have been factored out.
- Almost all of it should have been cut down a lot. I think now I could cut out at least half of the words while increasing clarity, but I didn’t know how at the time.
- I agree with plex that the title is crazy. I can never remember it. Maybe now I’d call it something like “An end-to-end argument for AI doom”?

Jeremy Gillen 15 Jan 2026 10:56 UTC
2 points
0
in reply to: Lukas Finnveden’s comment on: The corrigibility basin of attraction is a misleading gloss
Ah I see, that makes sense, sorry about that. This post was written with more emphasis on the distribution shift that reveals misalignment rather than the underlying degree of freedom that allowed that misalignment to happen in the first place. Both of these (degree of freedom and distribution shift) are necessary to cause misalignment, any other form of misalignment would (probably) just be crushed by RLHF or similar.

Jeremy Gillen 12 Jan 2026 14:30 UTC
2 points
0
in reply to: Lukas Finnveden’s comment on: The corrigibility basin of attraction is a misleading gloss
premised
That’s closer to being a conclusion rather than a premise. This section of this post or this is the main argument for that. It’s just an underspecification argument, you could see it as a generalization of Carlsmith’s counting argument.
It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”. I think the arguments linked support my view. Possibly I’ve not made some background reasoning or assumptions explicit.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.

Small Steps Towards Proving Stochastic → Deterministic Natural Latent

Alfred Harwood and Jeremy Gillen

8 Jan 2026 12:27 UTC

57 points

2 comments10 min readLW link

Jeremy Gillen 31 Dec 2025 19:27 UTC
4 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
When you say ‘that weakness’, you mean the inability to identify a subtask as alignment-related?
Mainly “bad actors can split their work...” with current LLMs, but yeah also identifying/guessing the overall intentions of humans giving subtasks.

Jeremy Gillen 31 Dec 2025 15:56 UTC
4 points
0
in reply to: Eli Tyre’s comment on: The corrigibility basin of attraction is a misleading gloss
But there often is a long period between when you stop endorsing the habit and when you’ve finally trained yourself to do something different. (If ever. Behavior-change is famously hard for humans.)
Yeah often, but I think that’s stretching the analogy too far. A scenario with a dangerous AI doing AI research has more self-improvement options than a human, and the stakes are higher (for the AI) than most human habit-breaking is to humans. This distribution shift is important when an AI has significantly greater-than-human self-improvement options. If it doesn’t, then the AI noticing non-endorsed habits may still happen, and that’d be good if it happened in a way that allowed humans to notice what’s wrong and try to fix it.
Also, I’ll note that religious deconversions very often happen in stages, including stages that involve narrow realizations that you were mistaken, and looming suspicions that you’re going to change your mind. The whole edifice doesn’t usually collapse in a single moment. It’s a process. (This interview covers a good example.)
Yeah, another good point, I agree. I haven’t watched that interview but I saw another video Rhett made. On the other hand, each stage can sometimies look like “slow buildup without acknowledgement of any update → crisis & fast update”. But I agree that religious deconversion is a decent analogy, and humans are playing the role of the pastor trying to catch and redirect the process, and that sometimes can work.
It’s unclear to me how strongly we can or should draw the analogy between changes in belief and changes in motivation, since one has a right answer and the other (presumably) doesn’t.
I don’t think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better. So with the habits, noticing that a habit is working against you can be as simple as updating a belief (about the consequences of that habit).
Yeah, but putting your attention on the right things often does take a lot of compute.
True but you don’t usually update or know that you’re going to update during this part.
Is the claim that there’s a dilemma between two options?
I think I don’t want to claim that there’s a strict dilemma, more that the paths between 1 and 2 are many and varied and hard to catch, even from the inside. Often because it’s fast, but sometimes just because it’s messy and there are lots of pressures and mechanisms involved.

Jeremy Gillen 30 Dec 2025 18:29 UTC
4 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
your current model is aligned in which case you already know how to align AI so what do you even need it for?
I don’t think this is an objection anyone makes. I think it’s widely agreed that the safety of more capable agents is harder to guarantee, so aligned low-capability models would be useful research assistants for that work. My central objections are different, mostly this and this. Maybe also that most people underestimate how difficult non-fake alignment research is.
If you think this is not a useful strategy, then why not?
1. Problem factorisation is difficult for non-trivial problems.
2. That weakness of LLMs is a skill issue, and should (probably, mostly) go away at >human-researcher capability.

Jeremy Gillen 30 Dec 2025 8:56 UTC
4 points
0
in reply to: Eli Tyre’s comment on: The corrigibility basin of attraction is a misleading gloss
Good point. There can be a middle ground, but most of the examples that come to mind are more binary.
E.g. If you notice that you don’t endorse a habit, this generally happens immediately. There’s not a long period of being uncertain whether you endorse the habit, and still following the habit. If you’re uncertain, and the habit-situation is coming up, this forces you to think it through.
On the other hand, with this one:
If the overseer is only invoked when you think the overseer knows more than you.
Seems like the AI’s understanding of how much the overseer knows could change gradually, and it might feel compelled to alert the overseer of this change. Maybe depends a bit on the internal mechanisms for this corrigibility-property, but mostly this one looks like gradual change could allow proactive flagging.
One background assumption that might be relevant: without various biases that lead to people being attached to beliefs, if a belief update can be made just by thinking about it, then the belief change will happen fast once attention is allocated to it. It’s rare for logical belief updates to require lots of compute, once your attention is on the right things.

Jeremy Gillen 23 Dec 2025 17:27 UTC
11 points
1
in reply to: Buck’s comment on: Tim Hua’s Shortform
I am confident about this, so I’m okay with you judging accordingly.
I appreciate your rewrite. I’ll treat it as something to aspire to, in future. I agree that it’s easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it’s much more like “Someone is wrong on the internet!”. It’s a valuable fuel and I don’t want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I’ll try to do it more habitually in future when it’s easy to do so.
There’s a background assumption that maybe I’m wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don’t always assume this, but I often assume it on Lesswrong because I’m among nerds for whom disagreement is normal.
So overall, I think my current guess is that you’re trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.

Jeremy Gillen 23 Dec 2025 16:57 UTC
2 points
0
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.

Jeremy Gillen 23 Dec 2025 16:27 UTC
4 points
0
in reply to: interstice’s comment on: Tim Hua’s Shortform
Yes, I agree with that.

Jeremy Gillen 23 Dec 2025 16:25 UTC
4 points
0
in reply to: faul_sname’s comment on: Tim Hua’s Shortform
I don’t think I am retreating to a motte.
My read was:
JG: Without ability to learn from mistakes
FS: Without optimal learning from mistakes
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
I assume you’re talking about this one?
No, I meant this one:
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
- The minimal viable scary agent is in fact scary.
- It doesn’t need to be superhuman at everything to be scary
- It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
- This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.

Jeremy Gillen 22 Dec 2025 5:11 UTC
LW: 3 AF: 1
0
AF
in reply to: Buck’s comment on: Tim Hua’s Shortform
one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong.
My argument was that there were several of “risk factors” that stack. I agree that each one isn’t overwhelmingly strong.
I prefer not to be rude. Are you sure it’s not just that I’m confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy’s argument for high confidence AI doom, would this still come across as rude to you?

Jeremy Gillen 22 Dec 2025 4:21 UTC
15 points
3
in reply to: interstice’s comment on: Tim Hua’s Shortform
“Overconfident” gets thrown around a lot by people who just mean “incorrect”. Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I’d be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem “underconfident” about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn’t real systematic underconfidence, it’s just a mistake (from my perspective). And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.

Jeremy Gillen 21 Dec 2025 13:32 UTC
8 points
2
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.

Jeremy Gillen 21 Dec 2025 11:46 UTC
6 points
0
in reply to: faul_sname’s comment on: Tim Hua’s Shortform
Wtithout optimally learning from mistakes
You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn’t be plausible when we condition on an otherwise low likelihood of making mistakes.
If you look at the most successful humans, they’re largely not the most-calibrated ones.
The most natural explanation for this is that it’s mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I agree calibration is less valuable than other measures of correctness. But there aren’t zero-sum “points” to be distributed here. Correcting for systematic overconfidence is basically free and doesn’t have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.

Jeremy Gillen

Small Steps Towards Prov­ing Stochas­tic → Deter­minis­tic Nat­u­ral Latent

Small Steps Towards Proving Stochastic → Deterministic Natural Latent