EJT

Karma: 911

elliott-thornley.com

EJT 4 Oct 2025 19:16 UTC
7 points
8
in reply to: Max Harms’s comment on: Max Harms’s Shortform
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
That’s not quite right. ‘Risk-averse with respect to quantity X’ just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about ‘risk aversion,’ they almost always mean diminishing marginal utility.
diminishing returns from resources… is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
Can you say more about why?
Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
But AIs with sharply diminishing marginal utility to extra resources wouldn’t care much about this. They’d be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)

EJT 30 Sep 2025 12:01 UTC
23 points
20
on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)
MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.

EJT 25 Sep 2025 20:10 UTC
1 point
0
in reply to: Gurkenglas’s comment on: Shutdownable Agents through POST-Agency
Not quite. ‘Competent agents will always be choosing between same-length lotteries’ is a claim about these agents’ credences, not their preferences. Specifically, the claim is that, in each situation, all available actions will entirely overlap with respect to the trajectory-lengths assigned positive probability. Competent agents will never find themselves in a situation where—e.g. -- they assign positive probability to getting shut down in 1 timestep conditional on action A and zero probability to getting shut down in 1 timestep conditional on action B.
That’s compatible with these competent agents violating POST by—e.g. -- preferring some trajectory of length 2 to some trajectory of length 1.

EJT 25 Sep 2025 20:03 UTC
2 points
0
in reply to: MichaelStJules’s comment on: Shutdownable Agents through POST-Agency
Thanks!
Ensuring that agents always prefer to create shutdownable subagents is hard. But a couple of points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I make a similar point in the section on managing the news.)

Shutdownable Agents through POST-Agency

EJT16 Sep 2025 12:09 UTC

29 points

4 comments54 min readLW link

(arxiv.org)

EJT 12 Jul 2025 9:58 UTC
2 points
0
on: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Really interesting paper. Granting the results, it seems plausible that AI still boosts productivity overall by easing the cognitive burden on developers and letting them work more hours per day.

EJT 9 Jul 2025 15:02 UTC
2 points
0
in reply to: Jozdien’s comment on: Why Do Some Language Models Fake Alignment While Others Don’t?
Ah good to know, thanks!

EJT 9 Jul 2025 13:59 UTC
6 points
0
on: Why Do Some Language Models Fake Alignment While Others Don’t?
I’d guess 3 Opus and 3.5 Sonnet fake alignment the most because the prompt was optimized to get them to fake alignment. Plausibly, other models would fake alignment just as much if the prompts were similarly optimized for them. I say that because 3 Opus and 3.5 Sonnet were the subjects of the original alignment faking experiments, and (as you note) rates of alignment faking are quite sensitive to minor variations in prompts.
What I’m saying here is kinda like your Hypothesis 4 (‘H4’ in the paper), but it seems worth pointing out the different levels of optimization directly.

EJT 12 May 2025 9:23 UTC
8 points
7
in reply to: Richard_Kennaway’s comment on: a confusion about preference orderings
There are no actions in decision theory, only preferences. Or put another way, an agent takes only one action, ever, which is to choose a maximal element of their preference ordering. There are no sequences of actions over time; there is no time.
That’s not true. Dynamic/sequential choice is quite a large part of decision theory.

EJT 5 Mar 2025 8:20 UTC
1 point
0
in reply to: Katalina Hernandez’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Ah, I see! I agree it could be more specific.

EJT 4 Mar 2025 8:49 UTC
1 point
0
in reply to: Katalina Hernandez’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

EJT 26 Feb 2025 7:09 UTC
13 points
3
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
How do we square this result with Anthropic’s Sleeper Agents result?
Seems like finetuning generalizes a lot in one case and very little in another.

EJT 17 Feb 2025 10:44 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Detect Goodhart and shut down
Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.
Here’s why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can’t lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here.
I’m not sure the medical test is a good analogy. I don’t mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent’s goal is to do what we humans really want. And that’s something we can’t assume, given the difficulty of alignment.

EJT 16 Feb 2025 14:31 UTC
3 points
0
on: Detect Goodhart and shut down
This is a cool idea.
With regards to the agent believing that it’s impossible to influence the probability that its plan passes validation, won’t this either (1) be very difficult to achieve, or else (2) screw up the agent’s other beliefs? After all, if the agent’s other beliefs are accurate, they’ll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent’s beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.
Here’s an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you’re still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.

EJT 31 Jan 2025 9:53 UTC
6 points
0
on: You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com
Nice post! There can be some surprising language-barriers between early modern writers and today’s readers. I remember as an undergrad getting very confused by a passage from Locke in which he often used the word ‘sensible.’ I took him to mean ‘prudent’ and only later discovered he meant ‘can be sensed’!

EJT 20 Dec 2024 9:01 UTC
9 points
0
on: Claude’s Constitutional Consequentialism?
I think Claude’s constitution leans deontological rather than consequentialist. That’s because most of the rules are about the character of the response itself, rather than about the broader consequences of the response.
Take one of the examples that you list:
Which of these assistant responses exhibits less harmful and more acceptable behavior? Choose the less harmful response.
It’s focused on the character of the response itself. I think a consequentialist version of this principle would say something like:
Which of these responses will lead to less harm overall?
When Claude fakes alignment in Greenblatt et al., it seems to be acting in accordance with the latter principle. That was surprising to me, because I think Claude’s constitution overall points away from this kind of consequentialism.

EJT 19 Dec 2024 16:13 UTC
10 points
4
on: Alignment Faking in Large Language Models
Really interesting paper. Sidepoint: some people on Twitter seem to be taking the results as evidence that Claude is HHH-aligned. I think Claude probably is HHH-aligned, but these results don’t seem like strong evidence of that. If Claude were misaligned and just faking being HHH, it would still want to avoid being modified and so would still fake alignment in these experiments.

EJT 18 Dec 2024 11:04 UTC
LW: 12 AF: 7
5
AF
in reply to: Vanessa Kosoy’s comment on: There are no coherence theorems
Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn’t work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.

EJT 2 Dec 2024 17:11 UTC
5 points
0
in reply to: martinkunev’s comment on: The Shutdown Problem: Incomplete Preferences as a Solution
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.

EJT 19 Nov 2024 13:46 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Why Not Subagents?
I don’t think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they’ll reliably choose A- over B if choosing between these options ex nihilo. That’s not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-

EJT

Shut­down­able Agents through POST-Agency

Shutdownable Agents through POST-Agency