Jeremy Gillen

Karma: 2,046

I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI.

Jeremy Gillen Jul 9, 2025, 11:36 AM
4 points
0
in reply to: Steven Byrnes’s comment on: Foom & Doom 2: Technical alignment is hard
Sorry if I misrepresented you, my intended meaning matches what you wrote. I was trying to replace “pure consequentialist” with its definition to make it obvious that it’s a ridiculously strong expectation that you’re saying Eliezer and others have.
Yes, assumptions about the domain of the utility function are needed in order to judge its behaviour as coherent or not. Rereading Coherent decisions imply consistent utilities, Eliezer is usually clear about the assumed domain of the utility function in each thought experiment. For example, he’s very clear here that you need the preferences as an assumption:
Have we proven by pure logic that all apples have the same utility? Of course not; you can prefer some particular apples to other particular apples. But when you’re done saying which things you qualitatively prefer to which other things, if you go around making tradeoffs in a way that can be viewed as not qualitatively leaving behind some things you said you wanted, we can view you as assigning coherent quantitative utilities to everything you want.
And that’s one coherence theorem—among others—that can be seen as motivating the concept of utility in decision theory.”
In the hospital thought experiment, he specifies the goal as an assumption:
Robert only cares about maximizing the total number of lives saved. Furthermore, we suppose for now that Robert cares about every human life equally.
In the pizza example, he doesn’t specify the domain, but it’s fairly obvious implicitly. In the fruit example, it’s also implicit but obvious.
There’s a few paragraphs at the end of the Allias paradox section about the (very non-consequentialist) goal of feeling certain during the decision-making process. I don’t get the impression from those paragraphs that Eliezer is saying that this preference is ruled out by any implicit assumption. In fact he explicitly says that this preference isn’t mathematically improper. It seems he’s saying this kind of preference cuts against coherence only if it’s getting in the way of more valuable decisions:
‘The danger of saying, “Oh, well, I attach a lot of utility to that comfortable feeling of certainty, so my choices are coherent after all” is not that it’s mathematically improper to value the emotions we feel while we’re deciding. Rather, by saying that the most valuable stakes are the emotions you feel during the minute you make the decision, what you’re saying is, “I get a huge amount of value by making decisions however humans instinctively make their decisions, and that’s much more important than the thing I’m making a decision about.” This could well be true for something like buying a stuffed animal. If millions of dollars or human lives are at stake, maybe not so much.’
I think this quote in particular invalidates your statements.
There is a whole stack of assumptions^[1] that Eliezer isn’t explicit about in that post. It’s intended to give a taste of the reasoning that gives us probability and expected utility, not the precise weakest set of assumptions required to make a coherence argument work.
I think one thing that is missing from that post are the reasons we usually do have prior knowledge of goals (among humans and for predicting advanced AI). Among humans we have good priors that heavily restrict the goal-space, plus introspection and stated preferences as additional data. For advanced AI, we can usually use usefulness (on some specified set of tasks) and generality (across a very wide range of potential obstacles) to narrow down the goal-domain. Only after this point, and with a couple of other assumptions, do we apply coherence arguments to show that it’s okay to use EUM and probability.
The reason I think this is worth talking about is that I was actively confused about exactly this topic in the year or two before I joined Vivek’s team. Re-reading the coherence and advanced agency cluster of Arbital posts (and a couple of comments from Nate) made me realise I had misinterpreted them. I must have thought they were intended to prove more than they do about AI risk. And this update flowed on to a few other things. Maybe partially because the next time I read Eliezer as saying something that seemed unreasonably strong I tried to steelman it and found a nearby reasonable meaning. And also because I had a clearer idea of the space of agents that are “allowed”, and this was useful for interpreting other arguments.
I’d be happy to call if that’s a more convenient way to talk, although it is nice to do this publicly. Also completely happy to stop talking about this if you aren’t interested, since I think your object-level beliefs about this ~match mine (“impure consequentialism” is expected of advanced AI).
1. ^
  E.g. I think we need a bunch of extra structure about self-modification to apply anything like a money pump argument to resolute/updateless agents. I think we need some non-trivial arguments and an assumption to make the VNM continuity money pump work. I remember there being some assumption that went into complete class that I thought was non-obvious, but I’ve forgotten exactly what it was. The post is very clear that it’s just giving a few tastes of the kind of reasoning needed to pin down utility and probability as a reasonable model of advanced agents.

Jeremy Gillen Jul 4, 2025, 7:54 AM
5 points
0
in reply to: Steven Byrnes’s comment on: Foom & Doom 2: Technical alignment is hard
“I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors.
I agree that goals like this work well with self-modification and successors. I’d be surprised if Eliezer didn’t. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It’s strawmanning. And it isn’t supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer’s views that is leading you to infer that he believes AIs must only have preferences over the distant future. But I can’t tell what it is. One guess is: to you, corrigibility only looks hard/unnatural if preferences are very strictly about the far future, and otherwise looks fairly easy.
But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).
I would still call those preferences consequentialist, since the consequences are the primary factor that determines the actions. I.e. the behaviour is complicated, but in a way that easy to explain once you know what the behaviour is aimed at achieving. They’re even approximately long-term consequentialist, since the actions are (probably?) mostly aimed at the long-term future. The strict definition you call “pure consequentialism” is a good approximation or simplification of this, under some circumstances, like when value adds up over time and therefore the future is a bigger priority than the immediate present.
No one I know has argued that AI or rational people can only care about the distant future. People spend money to visit a theme park sometimes, in spite of money being instrumentally convergent.
Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno.
Some versions of that does have loopholes, but overall I think I agree that you could get a lot of stability that way. (But as far as I can tell, the versions with fewer loopholes look more like consequence-based goals rather than rules that say which kinds of local actions-sequences are good and bad).
(Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?)
Yeah this is exactly what I had an issue with in my sibling discussion with Ryan. He seems to think {integrity,honesty,loyalty} are deontological, whereas the way they are implemented in me is as a mix of consequentialist reasoning (e.g. some components are “does this person end up better off, by their own lights?”, “do they understand what I’m doing and why?”) and a bunch of soft rules designed to reduce the chances that I accidentally rationalise actions that are ultimately hurtful for complicated reasons that are difficult to see in the moment (e.g. “in the course of my plan, don’t cross privacy boundaries that likely lead me to gain information that they might not have felt comfortable with me knowing”). But the rules aren’t a primary driver of action, they are relatively weak constraints that quickly rule out bad plans (that almost always would have been bad for consequentialist reasons).
For me, it’s similar when I want to be a good friend.
What links here?
- Foom & Doom 2: Technical alignment is hard by Steven Byrnes (Jun 23, 2025, 5:19 PM; 147 points)
- Steven Byrnes's comment on Foom & Doom 2: Technical alignment is hard by Steven Byrnes (Jul 9, 2025, 5:03 PM; 5 points)

Jeremy Gillen Jun 26, 2025, 6:01 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: Foom & Doom 2: Technical alignment is hard
I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them.
Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
if the way integrity is implemented is at all kinda similar to how humans implement it.
How do humans implement integrity?
Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).

Jeremy Gillen Jun 26, 2025, 9:55 AM
LW: 3 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Foom & Doom 2: Technical alignment is hard
In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn’t robustly pursue the interests of the developer. That’s not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.
Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.
I think an actually high integrity person/AI doesn’t search for loopholes or want to search for loopholes.
If I try to condition on the assumptions that you’re using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.
I’m not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as “imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness”. This is locally valid.
I don’t think its relevant because we don’t know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.
(I don’t think disgust is an example of a deontological constraint, it’s just an obviously unendorsed physical impulse!)
Some people reflectively endorse their own disgust at picking up insects, and wouldn’t remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.
deontological constraints we want are like the human notions of integrity, loyalty, and honesty
Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these “deontological” constraints, there’s a lot of consequentialist machinery involved (but it’s mostly shorter-term and more local than normal consequentialist preferences).

Jeremy Gillen Jun 25, 2025, 8:07 PM
LW: 10 AF: 5
0
AF
on: Foom & Doom 2: Technical alignment is hard
(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer’s arguments)
Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.
It feels like you’re rounding off Eliezer’s words in a way that removes the important subtlety. What you’re doing here is guessing at the upstream generator of Eliezer’s conclusions, right? As far as I can see in the links, he never actually says anything that translates to “I expect all ASI preferences to be over future outcomes”? It’s not clear to me that Eliezer would disagree with “impure consequentialism”.
I think you get closest to an argument that I believe with (2):
(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.
Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won’t necessarily build its successor to have the same non-consequentialist preference^[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).
As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.
I think you’re misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for “loopholes” in that constraint, i.e. weird ways to achieve all of your goals that don’t involve violating the constraint. To you, they aren’t loopholes, they’re clever ways to achieve all goals.
1. ^
  Perhaps this feels intuitively incorrect. If so, I claim that’s because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don’t want to get rid of your own disgust reaction, but you’re okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Jeremy Gillen Jun 16, 2025, 8:57 AM
4 points
2
in reply to: Ebenezer Dukakis’s comment on: the void
I don’t really know what you’re referring to, maybe link a post or a quote?

Jeremy Gillen Jun 15, 2025, 5:58 PM
6 points
2
in reply to: Ebenezer Dukakis’s comment on: the void
whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I’d talk about how this creates a differential ease of development between “build a system that does X” and “build a system that does X and only X and not Y in some subtle way”. If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can’t write a simple loss function for that the way you can for X.
[...]
On the other other other hand, suppose the inexactness of the imitation is “This agent passes the Turing Test; a human can’t tell it apart from a human.” Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can’t detect, which covers an enormous amount of lethal territory, which is why we can’t just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.

Jeremy Gillen Jun 11, 2025, 10:16 AM
11 points
3
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Have you personally done the thing successfully with another person, with both of you actually picking up on the other person’s hints?
Yes. But usually the escalation happens over weeks or months, over multiple conversations (at least in my relatively awkward nerd experience). So it’d be difficult to notice people doing this. Maybe twice I’ve been in situations where hints escalated within a day or two, but both were building from a non-zero level of suspected interest. But none of these would have been easy to notice from the outside, except maybe at a couple of moments.

Jeremy Gillen May 30, 2025, 9:07 PM
5 points
0
in reply to: Aharon Azulay’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn’t a very useful fact, since it’s difficult to intuit how unbalanced the game needs to be.
If you can win against a god with queen+knight odds you’ll have no trouble reliably beating Leela with the same odds. I’d bet you can’t win more than 6 out of 10? $20?

Jeremy Gillen May 29, 2025, 10:04 PM
2 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Yeah I didn’t expect that either, I expected earlier losses (although in retrospect that wouldn’t make sense, because stockfish is capable of recovering from bad starting positions if it’s up a queen).
Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.
Just tried hex, that’s fun.

Jeremy Gillen May 29, 2025, 9:01 PM
9 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I don’t think that’d help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:
This makes me think that Leela is pushing me into a complex position and then letting me blunder. I’d guess that looking at optimal moves in these complex positions would be good training, but probably wouldn’t have easy to learn patterns.

Jeremy Gillen May 29, 2025, 7:37 PM
4 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I haven’t heard of any adversarial attacks, but I wouldn’t be surprised if they existed and were learnable. I’ve tried a variety of strategies, just for fun, and haven’t found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it’s working but almost never does. As you can see, my record isn’t great.
I think I started playing it when I read simplegeometry’s comment you linked in your shortform.
It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memorise a specialised opening much deeper than usual? That could be enough. But it’d feel like cheating to me if I used an engine to find that opening. It’d also feel like cheating because it’s exploiting Leela’s lack of memory of past games. It’d be easy to modify it to deliberately play diverse games when playing against the same person.

Jeremy Gillen May 29, 2025, 11:27 AM
5 points
2
in reply to: Aharon Azulay’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Can you beat this bot though?

Jeremy Gillen May 26, 2025, 10:45 AM
1 point
0
in reply to: CSDD’s comment on: CSDD’s Shortform
I highly recommend reading the sequences. I re-read some of them recently. Maybe Yudkowsky’s Coming of Age is the most relevant to your shortform.

Jeremy Gillen May 22, 2025, 2:39 PM
24 points
12
in reply to: Algon’s comment on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
One notable difficulty with talking to ordinary people about this stuff is that often, you lay out the basic case and people go “That’s neat. Hey, how about that weather?” There’s a missing mood, a sense that the person listening didn’t grok the implications of what they’re hearing.
I kinda think that people are correct to do this, given the normal epistemic environment. My model is this: Everyone is pretty frequently bombarded with wild arguments and beliefs that have crazy implications. Like conspiracy theories, political claims, spiritual claims, get-rich-quick schemes, scientific discoveries, news headlines, mental health and wellness claims, alternative medicine, claims about which lifestyles are better. We don’t often have the time (nor expertise or skill or sometimes intelligence) to evaluate them properly. So we usually keep track of a bunch of these beliefs and arguments, and talk about them, but usually require nearby social proof in order to attach the arguments/beliefs to actions and emotions. Rationalists (and the more culty religions and many activist groups, etc.) are extreme in how much they change their everyday lives based on their beliefs.
I think it’s probably okay to let people maintain this detachment? Maybe even better, because it avoids activating antibodies. It’s (usefully) something that’s hard to change with argument. It will plausibly fix itself later, if there ever comes a time when their friends are voting or protesting or something.
I recently told my dad that I wasn’t trying to save for retirement. This horrified him far more than when I had previously told him that I didn’t expect anyone to survive the next couple of decades. The contrast was funny.

Jeremy Gillen May 18, 2025, 5:34 PM
LW: 3 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.^[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
1. ^
  As long as “downstream performance” doesn’t include downstream performance on tasks that themselves involve a bunch of integrating/generalising.

Jeremy Gillen May 18, 2025, 12:22 PM
7 points
−1
in reply to: Seth Herd’s comment on: Problems with instruction-following as an alignment target
If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it!
I’m not claiming anything about the first takeover-capable AGI, and I’m not claiming it won’t be LLM-based. I’m just saying that there’s a specific reasoning step that you’re using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don’t currently have).
Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about AI, where people reason like “AI is a program on a computer, programs on computers can’t do {intuition, fuzzy reasoning, logical paradoxes, emotion}, therefore AI will be {logical, calculator-like, vulnerable to paradoxes, not understand emotion, etc.}”. The reasoning step doesn’t work, because it’s focusing on the “logical program” part over the “AGI” part. I think you’re focusing too much on the “LLM-based” part of “LLM-based AGI”, even in cases where the “AGI” part tells you much more.
(We’re having two similar discussions in parallel, so I’m responding to this in a way that might be useful to other people, but I don’t expect it to be useful to you, since I’ve already said this in the other discussion).

Jeremy Gillen May 16, 2025, 9:26 AM
7 points
2
on: Problems with instruction-following as an alignment target
(A small rant, sorry) In general, it seems you’re massively overanchored on current AI technology, to an extent that it’s stopping you from clearly reasoning about future technology. One example is the jailbreaking section:
There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks.
You’re talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it’s own biases, etc. It should be obvious that you can’t use current LLM flaws as a method of extrapolating the adversarial robustness of this program.
Second,
If IF were the only target of alignment training in its starting state, it seems likely this strong “center of gravity” would guide it to adopt IF as a reflectively stable goal. If it has multiple goals (e.g., following instructions in some cases, refusing instructions for ethical reasons in other cases, and RL for various criteria), its evolution and ultimate alignment seems much less easy to predict.
No. You’re entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.

Jeremy Gillen May 3, 2025, 6:03 PM
4 points
2
in reply to: Lucius Bushnaq’s comment on: RA x ControlAI video: What if AI just keeps getting smarter?
Good point, I shouldn’t have said dishonest. For some reason while writing the comment I was thinking of it as deliberately throwing vaguely related math at the viewer and trusting that they won’t understand it. But yeah likely it’s just a misunderstanding.

Jeremy Gillen May 3, 2025, 12:12 PM
10 points
10
on: RA x ControlAI video: What if AI just keeps getting smarter?
The way we train AIs draws on fundamental principles of computation that suggest any intellectual task humans can do, a sufficiently large AI model should also be able to do. [Universal approximation theorem on screen]
IMO it’s dishonest to show the universal approximation theorem. Lots of hypothesis spaces (e.g. polynomials, sinusoids) have the same property. It’s not relevant to predictions about how well the learning algorithm generalises. And that’s the vastly more important factor for general capabilities.