Elliott Thornley (EJT)
accelerationist has no reason to expect the doomer to save any money. And if the doomer does save it (plus enough extra to cover the doubled payback), they’ve effectively just locked up double the original capital until the end of the world
Couldn’t the doomer and accelerationist just agree that the doomer doesn’t have to pay until (e.g.) one year after the bet resolves? Then the doomer could spend all the money in anticipation of doom. If the doomer loses the bet, they can use the year after resolution to earn money to pay the accelerationist back.
(Of course there are extra practical difficulties here, like e.g. it might be hard for humans to earn money in the future. But I’m just talking about theoretical barriers.)
Okay that’s good to know. I’ve mostly encountered the argument as a reply to individuals worrying that they’re getting Pascal’s-mugged into working on AI safety. In that sort of case,
AI safety can’t be a Pascal’s mugging because p(doom) is high
is invalid, and the premise needed to make it valid --
If p(doom) is high, then p(you can avert doom) is high
-- is way too doubtful to leave implicit.
But if the argument is a reply to people worried that the world/US government is getting Pascal’s-mugged into working on AI safety, then the premise needed to make it valid is
If p(doom) is high, then p(the world/USG can avert doom) is high
and I agree that premise is safe/uncontroversial enough to leave implicit.
“sure, taking strong actions to reduce risk from misaligned AI would be doable, but isn’t doing this a Pascal’s mugging (implicitly responding to how much people have emphasized the stakes while less so arguing for the risk)”
I don’t really understand what this perspective is saying. Is the idea that people tend to grant the premise ‘If p(doom) is high, then p(you avert doom) is high’? I agree p(doom) being high would be sufficient in that case.
Wait is God flipping the coin load-bearing for the craziness? Because strangers making wild promises isn’t that crazy.
Though he could get greater intelligence and more information/understanding about the world without doing any reflection on his values. This seems fairly likely to me. People tend to be not that interested in reflecting on their values. He might even want to lock in his current values, since that’s rational according to his current values.
Nice post! Miscellaneous thoughts:
if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.
Harsanyi’s theorem also requires that the social planner’s preferences satisfy the VNM axioms.
Not many philosophical proofs have been written
I think this all depends on what you mean by ‘many’. I’d guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.
I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by ‘theories’ in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn’t get you off the hook.)
Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this (‘Agreement plays almost no role’), tries to walk it back (‘But Agreement rules out the escape route...’), and then fully admits it (‘the fundamental impossibility holds regardless’).
Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I’d guess. I’ve found LLMs to be surprisingly bad at philosophy, even just the ‘understanding the view and its implications’ parts.)
I didn’t bother checking the second population ethics proof but it looks sloppy:
Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person’s welfare level by a small amount), M(μ) must rank A and B (no incomparability for “local” comparisons).”
Don’t any pair of populations “differ by at most some fixed bounded amount”? What is Claude doing including ‘e.g.’s in its formal statement of axioms?
With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.
Yes, you’d think so given that they’ve gotten so good at math! But when I’ve tried using LLMs to help with formal philosophy, I’ve found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I’m not sure why this is. I guess part of it is that it’s hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I’d expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn’t contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.
As far as practical applications go, the idea with these proofs—and with a lot of moral philosophy—is that unrealistic cases can help us figure out which principles we want to endorse, and then we can apply these principles in more realistic cases.
Though those previous experiments all involve a distribution shift, right?
That’s a cool idea! I’m not aware of any study like that, but I’d be very interested to see the results.
Yeah I think the ‘doing two things at once’ is an issue, though my main intended audience for this paper is academic philosophers and decision theorists who are—as a rule—both mathy and new to AI safety stuff.
Your other points sound kinda like a ‘Theorems are slow and detailed’ complaint to which I say: yes, but the detail helps guide our search for solutions. For example, it was thinking about Theorems 2 and 3-ish stuff that first got me thinking that incomplete preferences might help with shutdownability.
I am convinced of Independence as a requirement of rationality, for paying-to-avoid-information and money-pump reasons (like Yudkowsky’s), plus I think the Allais preferences aren’t much evidence against. I went for Indifference Between Indifference-Shifted Lotteries because it’s a little weaker (though since writing the paper I’ve been convinced that it’s not significantly weaker. Basically every endorsed decision theory that violates Independence also violates IBISL).
An example would be:
Option A: avoid shutdown and get utility 1.Option B: get shut down now
Option C: avoid shutdown and get utility 2.
If the agent’s option set is {A, B}, then the agent is indifferent between A and B. But if the agent’s option set is {A, B, C}, then the agent is indifferent between B and C, and hence prefers B to A.
Does that help? I don’t quite understand your point about “actions chosen before the shutdown are just like those of one maximizing the still-on utility,” so I might be missing something.
Really nice post. A few things though:
1.
The sophisticated chooser is also immune to money pumps
This isn’t right, at least on the usual definition of ‘money pump’ where an agent is money-pumped if and only if they “end up paying for something they could have kept for free even though they knew in advance what decision problem they were facing.” As you say, sophisticated choosers who violate Independence sometimes have to settle for plans that are dominated from the ex ante perspective. That’s a money pump on the usual definition.
2.
It doesn’t seem right to list Quiggin’s rank-dependent theory and Tversky and Kahneman’s cumulative prospect theory as evidence that Independence is normatively too strong, since (IIRC) both are put forward as descriptive models of how humans actually behave, rather than normative models of how they should behave. (That said, Lara Buchak defends rank-dependent theory as a normative model (under the name ‘Risk-Weighted Expected Utility Theory.’))
3.
You don’t really reckon with the arguments against resolute choice. I like Gustafsson’s discussion in chapter 7. A summary: resolute choice either requires acting against your own preferences at the moment of choice (which seems instrumentally irrational) or else modifying your preferences (which is no defence of your original preferences).
4.
I think the Allais argument against Independence doesn’t really work. The Allais preferences can be rational if you’d feel extra disappointed getting $0 when you only had a 1% chance of doing so. But $0-with-extra-disappointment is a different outcome to $0, so those preferences don’t violate Independence!
Nice, thanks for letting me know!
Why are you 30% in SPY if SPX is far better?
a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action.
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work. That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee’s death to avoid shutdown. It’s weird to me that that’s so often ignored.
Here’s another justification for hyperbolic discounting, drawing on the idea that you’re less psychologically connected to your future selves.
I’ve always seen this idea attributed to Martin Weitzman, and he cites these papers as making a similar point. Seems like an interesting case of simultaneous discovery: four papers making the same sort of point all appearing between 1996 and 1999.
The bet still resolves at the same time. The doomer just has one year after resolution to get their bank balance back up from $0 so they can pay the accelerationist back.