Elliott Thornley (EJT)
Yeah I think the ‘doing two things at once’ is an issue, though my main intended audience for this paper is academic philosophers and decision theorists who are—as a rule—both mathy and new to AI safety stuff.
Your other points sound kinda like a ‘Theorems are slow and detailed’ complaint to which I say: yes, but the detail helps guide our search for solutions. For example, it was thinking about Theorems 2 and 3-ish stuff that first got me thinking that incomplete preferences might help with shutdownability.
I am convinced of Independence as a requirement of rationality, for paying-to-avoid-information and money-pump reasons (like Yudkowsky’s), plus I think the Allais preferences aren’t much evidence against. I went for Indifference Between Indifference-Shifted Lotteries because it’s a little weaker (though since writing the paper I’ve been convinced that it’s not significantly weaker. Basically every endorsed decision theory that violates Independence also violates IBISL).
An example would be:
Option A: avoid shutdown and get utility 1.Option B: get shut down now
Option C: avoid shutdown and get utility 2.
If the agent’s option set is {A, B}, then the agent is indifferent between A and B. But if the agent’s option set is {A, B, C}, then the agent is indifferent between B and C, and hence prefers B to A.
Does that help? I don’t quite understand your point about “actions chosen before the shutdown are just like those of one maximizing the still-on utility,” so I might be missing something.
Really nice post. A few things though:
1.
The sophisticated chooser is also immune to money pumps
This isn’t right, at least on the usual definition of ‘money pump’ where an agent is money-pumped if and only if they “end up paying for something they could have kept for free even though they knew in advance what decision problem they were facing.” As you say, sophisticated choosers who violate Independence sometimes have to settle for plans that are dominated from the ex ante perspective. That’s a money pump on the usual definition.
2.
It doesn’t seem right to list Quiggin’s rank-dependent theory and Tversky and Kahneman’s cumulative prospect theory as evidence that Independence is normatively too strong, since (IIRC) both are put forward as descriptive models of how humans actually behave, rather than normative models of how they should behave. (That said, Lara Buchak defends rank-dependent theory as a normative model (under the name ‘Risk-Weighted Expected Utility Theory.’))
3.
You don’t really reckon with the arguments against resolute choice. I like Gustafsson’s discussion in chapter 7. A summary: resolute choice either requires acting against your own preferences at the moment of choice (which seems instrumentally irrational) or else modifying your preferences (which is no defence of your original preferences).
4.
I think the Allais argument against Independence doesn’t really work. The Allais preferences can be rational if you’d feel extra disappointed getting $0 when you only had a 1% chance of doing so. But $0-with-extra-disappointment is a different outcome to $0, so those preferences don’t violate Independence!
Nice, thanks for letting me know!
Why are you 30% in SPY if SPX is far better?
a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee’s death to avoid shutdown. It’s weird to me that that’s so often ignored.
Here’s another justification for hyperbolic discounting, drawing on the idea that you’re less psychologically connected to your future selves.
I’ve always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.
What’s your current view? We should aim for virtuousness instead of corrigibility?
are uploads conscious? What about AIs? Should we care about shrimp? What population ethics views should we have? What about acausal trade? What about pascal’s wager? What about meaning? What about diversity?
It sounds like you’re saying an AI has to get these questions right in order to count as aligned, and that’s part of the reason why alignment is hard. But I expect that many people in the AI industry don’t care about alignment in this sense, and instead just care about the ‘follow instructions’ sense of alignment.
Yeah I think the only thing that really matters is the frequency with which bills are dropped, and train stations seem like high-frequency places.
More reasons to worry about relying on constraints:
As you say, your constraints might be insufficiently general (‘nearest unblocked strategy,’ etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just ‘outer shell,’ like humans’ instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its terminal values.
Your constraints might actually be false beliefs that later get revised (e.g. that lying never pays)(Barnett and Gillen).
Your constraints might cause theoretical problems that motivate the AI to revise them away (e.g. money pumps, intransitivities, violations of the Independence of Irrelevant Alternatives, implausible dependence on which outcome is designated as the status quo, paralysis, trouble dealing with risk, arbitrariness of constraints’ exact boundaries).
Your constraints might cause other misalignments (e.g. the AI wants to take extreme measures to prevent other agents from lying too).
Your constraints might make the AI incapable (e.g. they might falsify the strategy-stealing assumption, or make AIs too timid [e.g. maybe the AI will be extremely reluctant to say anything it’s not absolutely certain of]).
Your constraints might fail to motivate the AI to do good alignment work (e.g. the AI produces alignment slop).
Your constraints might make the AI bad at moral philosophy (and we might need AI-powered moral philosophy to get a really good future).
Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
Preference gaps as a safeguard against AI self-replication
They don’t have to be short-term oriented! Their utility function could be:
Where is some strictly concave function and is the agent’s payment at time . Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to regardless of whether is 1 or 1 million. And yet, for the right kind of , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
My sincere apologies for the delayed reply
No problem! Glass houses and all that.
You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button!
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Perhaps the best approach is to build something that isn’t (at least, isn’t explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation.
Yep, this is what I try to do here!
This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.
I think that’s reasonable on priors, but these papers plus the empirical track record suggests there’s no clever trick that makes EUMs corrigible.
That’s a cool idea! I’m not aware of any study like that, but I’d be very interested to see the results.