Jeremy Gillen

Karma: 2,630

I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI.

Jeremy Gillen 27 Nov 2025 4:35 UTC
3 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.
I think the “follow the spirit of the rule” thing is more like rule utilitarianism than like deontology. When I try to follow the spirit of a rule, the way that I do this is by understanding why the rule was put in place. In other words, I switch to consequentialism. As an agent that doesn’t fully trust itself, it’s worth following rules, but the reason you keep following them is that you understand why putting the rules in place makes the world overall better from a consequentialist standpoint.
So I have a hypothesis: It’s important for an agent to understand the consequentialist reasons for a rule, if you want its deontological respect for that rule to remain stable as it considers how to improve itself.

Jeremy Gillen 20 Nov 2025 7:29 UTC
5 points
0
in reply to: Towards_Keeperhood’s comment on: Serious Flaws in CAST
I’m writing a post about this at the moment. I’m confused about how you’re thinking about the space of agents, such that “maybe we don’t need to make big changes”?
When you make a bigger change you just have to be really careful that you land in the basin again.
How can you see whether you’re in the basin? What actions help you land in the basin?
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn’t fall deeper into the basin by itself, it only happens because of humans fixing problems.

Jeremy Gillen 17 Nov 2025 5:04 UTC
19 points
7
in reply to: Ryan Kidd’s comment on: AI safety undervalues founders
(Ryan is correct about what I’m referring to, and I don’t know any details).
I want to say publicly, since my comment above is a bit cruel in singling out MATS specifically: I think MATS is the most impressively well-run organisation that I’ve encountered, and overall supports good research. Ryan has engaged at length with my criticisms (both now and when I’ve raised them before), as have others on the MATS team, and I appreciate this a lot.
Ultimately most of our disagreements are about things that I think a majority of “the alignment field” is getting wrong. I think most people don’t consider it Ryan’s responsibility to do better at research prioritization than the field as a whole. But I do. It’s easy to shirk responsibility by deferring to committees, so I don’t consider that a good excuse.
A good excuse is defending the object-level research prioritization decisions, which Ryan and other MATS employees happily do. I appreciate them for this, and we agree to disagree for now.
Tying back to the OP, I maintain that multiplier effects are often overrated because of people “slipping off the real problem” and this is a particularly large problem with founders of new orgs.

Jeremy Gillen 16 Nov 2025 6:17 UTC
47 points
48
on: AI safety undervalues founders
I want to register disagreement. Multiplier effects are difficult to get and easy to overestimate. It’s very difficult to get other people working on the right problem, rather than slipping off and working on an easier but ultimately useless problem. From my perspective, it looks like MATS fell into this exact trap. MATS has kicked out ~all the mentors who were focused on real problems (in technical alignment) and has a large stack of new mentors working on useless but easy problems.
[Edit 5hrs later: I think this has too much karma because it’s political and aggressive. It’s a very low effort criticism without argument.]

AI Corrigibility Debate: Max Harms vs. Jeremy Gillen

Liron, Max Harms and Jeremy Gillen

14 Nov 2025 4:09 UTC

43 points

1 comment75 min readLW link

(doomdebates.com)

Jeremy Gillen 10 Nov 2025 17:52 UTC
4 points
0
in reply to: David Lorell’s comment on: Resampling Conserves Redundancy (Approximately)
By the way, there seems to be an issue where sympy silently drops precision under some circumstances. Definitely a bug. A couple of times it’s caused non-trivial errors in my KLs. It’s pretty rare, but I don’t know any way to completely avoid it. Thinking of switching to a different library.

Jeremy Gillen 6 Nov 2025 1:24 UTC
10 points
0
in reply to: Veedrac’s comment on: Some data from LeelaPieceOdds
Relevant comment on reddit from someone working on Leela Odds:

Jeremy Gillen 5 Nov 2025 13:58 UTC
9 points
4
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
Why would models start out aligned by default?

Jeremy Gillen 2 Nov 2025 9:06 UTC
14 points
0
in reply to: Daniel Kokotajlo’s comment on: Some data from LeelaPieceOdds
This is the best I’ve got so far. I estimated the rating using the midpoint of a logistic regression fit to the games. The first few especially seem to have been inflated due to not having enough high rated players in the data, so it had to extrapolate. And they all seem inflated by (I’d guess) a couple of hundred points due to the effects I mentioned in the post. (Edit: Please don’t share the graph alone without this context).
The NN rating in the Blitz data highlights the flaw in this method of estimating the rating.
I haven’t found a way to get similar data on human vs human games.

Jeremy Gillen 1 Nov 2025 15:53 UTC
2 points
0
in reply to: p.b.’s comment on: Some data from LeelaPieceOdds
Took a while to download all this. I’m curious what your blitz rating is?

Jeremy Gillen 31 Oct 2025 6:44 UTC
2 points
0
in reply to: niplav’s comment on: niplav’s Shortform
Does that sound right?
Can’t give a confident yes because I’m pretty confused about this topic, and I’m pretty unhappy currently with the way the leverage prior mixes up action and epistemics. The issue about discounting theories of physics if they imply high leverage seems really bad? I don’t understand whether the UDASSA thing fixes this. But yes.
That avoids the “how do we encode numbers” question that naturally raises itself.
I’m not sure how natural the encoding question is, there’s probably an AIT answer to this kind of question that I don’t know.

Jeremy Gillen 31 Oct 2025 6:31 UTC
4 points
0
in reply to: J Bostock’s comment on: Some data from LeelaPieceOdds
By “control plausibly works” I didn’t mean “Stuff like existing monitoring will work to control AIs forever”. I meant it works if it is a stepping stone allows us to accelerate/finish alignment research, and thereby build aligned AGI.

Jeremy Gillen 31 Oct 2025 6:27 UTC
2 points
0
in reply to: Fabien Roger’s comment on: Some data from LeelaPieceOdds
I think several of the subquestions that matter for whether it’ll plausibly work to have AI solve alignment for us are in the second category. Like the two points I mentioned in the post. I think there are other subquestions that are more in the first category, which are also relevant to the odds of success. I’m relatively low confidence about this kind of stuff because of all the normal reasons why it’s difficult to say how other people should be thinking. It’s easy to miss relevant priors, evidence, etc. But still… given what I know about what everyone believes, it looks like these questions should be resolvable among reasonable people.

Jeremy Gillen 30 Oct 2025 16:44 UTC
4 points
0
in reply to: niplav’s comment on: niplav’s Shortform
Makes sense, but in that case, why penalize by time? Why not just directly penalize by utility? Like the leverage prior.
Also, why not allow floating point representations of utility to be output? Rather than just binary integers?

Jeremy Gillen 30 Oct 2025 16:04 UTC
4 points
0
in reply to: niplav’s comment on: niplav’s Shortform
Aren’t there programs that run fast and also return a number that grows much faster than |p|? Like up arrow notation. Why don’t these grow faster than your speed prior penalizes them?

Jeremy Gillen 30 Oct 2025 2:02 UTC
4 points
0
in reply to: Fabien Roger’s comment on: Some data from LeelaPieceOdds
I think there are reasonable people who look at the evidence and think it plausible that control works, and also reasonable people who look at the evidence and think it implausible that control works. And others who think that openai-superalignment-style plans plausibly work.
Something is going wrong here.

Jeremy Gillen 30 Oct 2025 1:42 UTC
6 points
0
in reply to: Thomas Kwa’s comment on: Some data from LeelaPieceOdds
Nice! Yeah after Leela is down to a couple of pieces the game is usually over. But I’ve had to learn not to be lazy at this point because a couple of times it managed to force a stalemate.
Why have you read chess books?

Some data from LeelaPieceOdds

Jeremy Gillen29 Oct 2025 4:27 UTC

66 points

21 comments3 min readLW link

Jeremy Gillen 23 Oct 2025 2:59 UTC
5 points
0
in reply to: johnswentworth’s comment on: Resampling Conserves Redundancy (Approximately)
our current thread is an adjustment to the error measure.
We’re not sure that this is necessary. I quite like the current form of the errors. I’ve spent much of the past week searching for counterexamples to the ∃ deterministic latent theorem and I haven’t found anything yet (although it’s partially a manual search). My current approach takes a P(X_1,X_2) distribution, finds a minimal stochastic NL, then finds a minimal deterministic NL. The deterministic error has always been within a factor of 2 of the stochastic error. So currently we’re expecting the theorem can be rescued.
$D_{K L} (P [X] | | \sum_{Λ} Q [X, Λ])$ rather than $D_{K L} (P [X, Λ] | | Q [X, Λ])$
That seems like a cool idea for the mediation condition, but Isn’t it trivial for the redundancy conditions?

Jeremy Gillen 18 Oct 2025 3:49 UTC
7 points
0
in reply to: David Lorell’s comment on: Resampling Conserves Redundancy (Approximately)
That’ll be the difference between max and sum in the denominator. If you use sum it’s 3.39.
Here’s one we worked out last night, where the ratio goes to infinity.

Jeremy Gillen

AI Cor­rigi­bil­ity De­bate: Max Harms vs. Jeremy Gillen

Some data from LeelaPieceOdds

AI Corrigibility Debate: Max Harms vs. Jeremy Gillen