Elliott Thornley (EJT)
[Edit: 2009 in fact!]
Derek Parfit wrote up some thoughts along these lines in 1984:
I shall first distinguish threats from warnings. When I say that I shall do X unless you do Y, call this a warning if my doing X would be worse for you but not for me, and a threat if my doing X would be worse for both of us. Call me a threat‐fulfiller if I would always fulfil my threats.
Suppose that, apart from being a threat‐fulfiller, someone is never self‐denying. Such a person would fulfil his threats even though he knows that this would be worse for him. But he would not make threats if he believed that doing so would be worse for him. This is because, apart from being a threat‐fulfiller, this person is never self‐denying. He never does what he believes will be worse for him, except when he is fulfilling some threat. This exception does not cover making threats.
Suppose that we are all both transparent and never self‐denying. If this was true, it would be better for me if I made myself a threat‐fulfiller, and then announced to everyone else this change in my dispositions. Since I am transparent, everyone would believe my threats. And believed threats have many uses. Some of my threats could be defensive, intended to protect me from aggression by others. I might confine myself to defensive threats. But it would be tempting to use my known disposition in other ways. Suppose that the benefits of some co‐operation are shared between us. And suppose that, without my co‐operation, there would be no further benefits. I might say that, unless I get the largest share, I shall not co‐operate. If others know me to be a threat‐fulfiller, and they are never self‐denying, they will give me the largest share. Failure to do so would be worse for them.
Other threat‐fulfillers might act in worse ways. They could reduce us to slavery. They could threaten that, unless we become their slaves, they will bring about our mutual destruction. We would know that these people would fulfil their threats. We would therefore know that we can avoid destruction only by becoming their slaves.
The answer to threat‐fulfillers, if we are all transparent, is to become a threat‐ignorer. Such a person always ignores threats, even when he knows that doing so will be worse for him. A threat‐fulfiller would not threaten a transparent threat‐ignorer. He would know that, if he did, his threat would be ignored, and he would fulfil this threat, which would be worse for him.
If we were all both transparent and never self‐denying, what changes in our dispositions would be better for each of us? I answer this question in Appendix A, since parts of the answer are not relevant to the question I am now discussing. What is relevant is this. If we were all transparent, it would probably be better for each of us if he became a trustworthy threat‐ignorer. These two changes would involve certain risks; but these would be heavily outweighed by the probable benefits. What would be the benefits from becoming trustworthy? That we would not be excluded from those mutually advantageous agreements that require self‐denial. What would be the benefits from becoming threat‐ignorers? That we would avoid becoming the slaves of threat‐fulfillers.
We can next assume that we could not become trustworthy threat‐ignorers unless we changed our beliefs about rationality. Those who are trustworthy keep their promises even when they know that this will be worse for them. We can assume that we could not become disposed to act in this way unless we believed that it is rational to keep such promises. And we can assume that, unless we were known to have this belief, others would not trust us to keep such promises. On these assumptions, S tells us to make ourselves have this belief. Similar remarks apply to becoming threat‐ignorers. We can assume that we could not become threat‐ignorers unless we believed that it is always rational to ignore threats. And we can assume that, unless we have this belief, others would not be convinced that we are threat‐ignorers. On these assumptions, S tells us to make ourselves have this belief. These conclusions can be combined. S tells us to make ourselves believe that it is always irrational to do what we believe will be worse for us, except when we are keeping promises or ignoring threats.
Does this fact support these beliefs? According to S, it would be rational for each of us to make himself believe that it is rational to ignore threats, even when he knows that this will be worse for him. Does this show this belief to be correct? Does it show that it is rational ignore such threats?
It will help to have an example. Consider
My Slavery. You and I share a desert island. We are both transparent, and never self‐denying. You now bring about one change in your dispositions, becoming a threat‐fulfiller. And you have a bomb that could blow the island up. By regularly threatening to explode this bomb, you force me to toil on your behalf. The only limit on your power is that you must leave my life worth living. If my life became worse than that, it would cease to be better for me to give in to your threats.
How can I end my slavery? It would be no good killing you, since your bomb will automatically explode unless you regularly dial some secret number. But suppose that I could make myself transparently a threat‐ignorer. Foolishly, you have not threatened that you would ignore this change in my dispositions. So this change would end my slavery.
Would it be rational for me to make this change? There is the risk that you might make some new threat. But since doing so would be clearly worse for you, this risk would be small. And, by taking this small risk, I would almost certainly gain a very great benefit. I would almost certainly end my slavery. Given the wretchedness of my slavery, it would be rational for me, according to S, to cause myself to become a threat‐ignorer. And, given our other assumptions, it would be rational for me to cause myself to believe that it is always rational to ignore threats. Though I cannot be wholly certain that this will be better for me, the great and nearly certain benefit would outweigh the small risk. (In the same way, it would never be wholly certain that it would be better for someone if he became trustworthy. Here too, all that could be true is that the probable benefits outweigh the risks.)
Assume that I have now made these changes. I have become transparently a threat‐ignorer, and have made myself believe that it is always rational to ignore threats. According to S, it was rational for me to cause myself to have this belief. Does this show this belief to be correct?
Let us continue the story.
How I End My Slavery. We both have bad luck. For a moment, you forget that I have become a threat‐ignorer. To gain some trivial end—such as the coconut that I have just picked—you repeat your standard threat. You say, that, unless I give you the coconut, you will blow us both to pieces. I know that, if I refuse, this will certainly be worse for me. I know that you are reliably a threat‐fulfiller, who will carry out your threats even when you know that this will be worse for you. But, like you, I do not now believe in the pure Self‐interest Theory. I now believe that it is rational to ignore threats, even when I know that this will be worse for me. I act on my belief. As I foresaw, you blow us both up.
Is my act rational? It is not. As before, we might concede that, since I am acting on a belief that it was rational for me to acquire, I am not irrational. More precisely, I am rationally irrational. But what I am doing is not rational. It is irrational to ignore some threat when I know that, if I do, this will be disastrous for me and better for no one. S told me here that it was rational to make myself believe that it is rational to ignore threats, even when I know that this will be worse for me. But this does not show this belief to be correct. It does not show that, in such a case, it is rational to ignore threats.
We can draw a wider conclusion. This case shows that we should reject
(G2) If it is rational for someone to make himself believe that it is rational for him to act in some way, it is rational for him to act in this way.
Return now to B, the belief that it is rational to keep our promises even when we know that this will be worse for us. On the assumptions made above, S implies that it is rational for us to make ourselves believe B. Some people claim that this fact supports B, showing that it is rational to keep such promises. But this claim seems to assume (G2), which we have just rejected.
There is another objection to what these people claim. Even though S tells us to try to believe B, S implies that B is false. So, if B is true, S must be false. Since these people believe B, they should believe that S is false. Their claim would then assume
(G3) If some false theory about rationality tells us to make ourselves have a particular belief, this shows this belief to be true.
But we should obviously reject (G3). If some false theory told us to make ourselves believe that the Earth was flat, this would not show this to be so.
S told us to try to believe that it is rational to ignore threats, even when we know that this will be worse for us. As my example shows, this does not support this belief. We should therefore make the same claim about keeping promises. There may be other grounds for believing that it is rational to keep our promises, even when we know that doing so will be worse for us. But this would not be shown to be rational by the fact that the Self‐interest Theory itself told us to make ourselves believe that it was rational. It has been argued that, by appealing to such facts, we can solve an ancient problem_ we can show that, when it conflicts with self‐interest, morality provides the stronger reasons for acting. This argument fails. The most that it might show is something less. In a world where we are all transparent—unable to deceive each other—it might be rational to deceive ourselves about rationality.
Yeah good question. I think unfortunately the POST structure by itself doesn’t give us any guarantees here: if spawning subagents looked like a good enough move conditional on each possible trajectory-length, POST-agents would do it. But note a couple points. First:
Note a nice feature of artificial agents. Making perfect copies is easy: as easy as copying and pasting a long list of parameters. If our original agent makes new agents by copying itself, these new agents will also be neutral and shutdownable. Making imperfect, non-neutral copies is hard. Our original agent would have to find some way to change the parameters of the new agents in a way that makes these agents non-neutral but preserves their capabilities. That would likely involve setting up some training process, in which case our original agent faces its own alignment problem in getting these new agents to do what it wants (Salib, 2024). It will also be much easier for us to watch out for (Greenblatt & Shlegeris, 2024) and train against the production of new non-neutral agents, as compared to us having to watch out for and train against every possible kind of shutdown-resistance.
Second, POST-agents won’t pay costs to shift probability mass away from earlier shutdowns. So if they try to create subagents, they won’t pay any costs to do so covertly. So likely these attempts will be easy for us humans to notice and intervene on. (I go into this in more detail here.)
That’s great to hear! Looking forward to seeing the results.
Wait can you explain the MATS 9 link? I couldn’t find any reference to reward hacking there.
For inoculation pretraining, I’m imagining we’d add data about good-but-reward-hacking AIs to pretraining, but still in post-training we’d try to train AIs to resist the tendency to hack reward and we’d try to train them to point out a rival’s attempts to hack reward, etc. The data about good-but-reward-hacking AIs is just there as a fallback in case we fail and accidentally train AIs to reward hack in post-training.
It’s possible that adding the data about good-but-reward-hacking AIs would nontrivially increase the probability that we fail, but I’m not sure. It seems probable to me that reward hacks are something that AIs will explore their way into, whether or not we add data about good-but-reward-hacking AIs to pretraining.
I’d try posting on r/Catholicism and emailing some reporters at Catholic outlets.
I’m a bit confused about what
solve agent foundations
means and why it would be necessary for
having a good picture of what we should even be looking for
Can you say more?
perhaps most actual Catholics aren’t bothered by this
My guess is that a lot of Catholics would be very bothered by this.
Yeah it seems to me like predictive optimization is often a kind of internal selective optimization.
This looks extremely cool and useful.
Coherence arguments give us an initial foothold. The basic insight is that if an agent’s preferences are inconsistent — say, it prefers A to B, B to C, and C to A — a clever adversary can cycle it through trades that each look locally acceptable but leave it strictly worse off, wasting resources for no gain. Any agent that reliably avoids such “dominated strategies” must behave as if it has consistent preferences representable by a utility function.
But I’m surprised to see this being said, and surprised to see that this post isn’t referenced anywhere. I’m biased of course, but I think it makes a true and important point. The post stirred up some controversy at the time (and of course it’s better to talk about the object-level issues), but I take the controversy to have mostly resolved in favor of the point I make. See e.g. this retrospective and Vanessa Kosoy calling the-coherence-argument-as-stated-in-the-quote-above a weak man.
modus tollens Parfit’s ideas about personal identity. I think that’s a worthwhile angle, and more practically useful these days than using Reasons and Persons to dissuade people of egocentricity, which afaict is closer to Parfit’s original goals.
Can you explain a bit more what you mean by this?
Surely not any monetary stakes
The bet still resolves at the same time. The doomer just has one year after resolution to get their bank balance back up from $0 so they can pay the accelerationist back.
accelerationist has no reason to expect the doomer to save any money. And if the doomer does save it (plus enough extra to cover the doubled payback), they’ve effectively just locked up double the original capital until the end of the world
Couldn’t the doomer and accelerationist just agree that the doomer doesn’t have to pay until (e.g.) one year after the bet resolves? Then the doomer could spend all the money in anticipation of doom. If the doomer loses the bet, they can use the year after resolution to earn money to pay the accelerationist back.
(Of course there are extra practical difficulties here, like e.g. it might be hard for humans to earn money in the future. But I’m just talking about theoretical barriers.)
Okay that’s good to know. I’ve mostly encountered the argument as a reply to individuals worrying that they’re getting Pascal’s-mugged into working on AI safety. In that sort of case,
AI safety can’t be a Pascal’s mugging because p(doom) is high
is invalid, and the premise needed to make it valid --
If p(doom) is high, then p(you can avert doom) is high
-- is way too doubtful to leave implicit.
But if the argument is a reply to people worried that the world/US government is getting Pascal’s-mugged into working on AI safety, then the premise needed to make it valid is
If p(doom) is high, then p(the world/USG can avert doom) is high
and I agree that premise is safe/uncontroversial enough to leave implicit.
“sure, taking strong actions to reduce risk from misaligned AI would be doable, but isn’t doing this a Pascal’s mugging (implicitly responding to how much people have emphasized the stakes while less so arguing for the risk)”
I don’t really understand what this perspective is saying. Is the idea that people tend to grant the premise ‘If p(doom) is high, then p(you avert doom) is high’? I agree p(doom) being high would be sufficient in that case.
Wait is God flipping the coin load-bearing for the craziness? Because strangers making wild promises isn’t that crazy.
Though he could get greater intelligence and more information/understanding about the world without doing any reflection on his values. This seems fairly likely to me. People tend to be not that interested in reflecting on their values. He might even want to lock in his current values, since that’s rational according to his current values.
Nice post! Miscellaneous thoughts:
if individuals have VNM utility functions, and if the Pareto principle holds over groups, then a version of utilitarianism must be true.
Harsanyi’s theorem also requires that the social planner’s preferences satisfy the VNM axioms.
Not many philosophical proofs have been written
I think this all depends on what you mean by ‘many’. I’d guess maybe 10% of analytic philosophy papers include a proof of some kind, so that at least hundreds of proofs are published every year. And in a sense, every valid (spelled-out) argument is a proof.
I agree that the Claude proofs are pretty bad. The Arrhenius point is fairly obvious: what Arrhenius means by ‘theories’ in that paper is weak orders on populations, so if after taking into account moral uncertainty you still have a weak order, then the impossibility theorem still applies. (And later Arrhenius theorems relax both completeness and transitivity, so even departing from a weak order doesn’t get you off the hook.)
Claude makes this kind of point, but first it introduces an Agreement axiom that the proof never uses. Claude later comes close to admitting this (‘Agreement plays almost no role’), tries to walk it back (‘But Agreement rules out the escape route...’), and then fully admits it (‘the fundamental impossibility holds regardless’).
Which Claude model did you use? Did you use extended thinking? The flip-flopping above makes me think there was no extended thinking, and maybe a model with extended thinking would do better. (Though not much better I’d guess. I’ve found LLMs to be surprisingly bad at philosophy, even just the ‘understanding the view and its implications’ parts.)
I didn’t bother checking the second population ethics proof but it looks sloppy:
Axiom (Sufficient Comparability). For any pair of populations A, B that differ by at most some fixed bounded amount (e.g., adding or removing one person, or changing one person’s welfare level by a small amount), M(μ) must rank A and B (no incomparability for “local” comparisons).”
Don’t any pair of populations “differ by at most some fixed bounded amount”? What is Claude doing including ‘e.g.’s in its formal statement of axioms?
With some additional effort, present-day LLMs might be capable of coming up with a good novel proof. If not, then it will likely be possible soon. Most kinds of moral philosophy might be difficult for AIs, but proofs are one area where AI assistance seems promising.
Yes, you’d think so given that they’ve gotten so good at math! But when I’ve tried using LLMs to help with formal philosophy, I’ve found them to be really surprisingly bad, even at parts that seem very math-loaded (e.g. inventing proofs, following arguments, grasping views and their implications, coming up with counterexamples, etc.). I’m not sure why this is. I guess part of it is that it’s hard to do RLVR on philosophy in the same way that you can do RLVR on math, but naively I’d expect more generalization from math to formal philosophy. Maybe the following is a factor: pretraining data doesn’t contain that much bad mathematical reasoning, but it contains a huge amount of bad philosophical reasoning.
As far as practical applications go, the idea with these proofs—and with a lot of moral philosophy—is that unrealistic cases can help us figure out which principles we want to endorse, and then we can apply these principles in more realistic cases.
I basically agree.
I don’t know. I think a lot of people would say something like, ‘They’re making excuses because they knew they couldn’t win the race.’
I don’t know. People might wonder, ‘If coordination is possible, why did they choose to shut down unilaterally?’