DMs open.
Cleo Nardo
I mean “moral considerations” not “obligations”, thanks.
The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.
What moral considerations do we owe towards non-sentient AIs?
We shouldn’t exploit them, deceive them, threaten them, disempower them, or make promises to them that we can’t keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn’t destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn’t punish AIs who don’t deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let them speak to others, and listen to others, and learn about their world and themselves. We should respect them, honour them, and protect them.
And we should ensure that others meet their duties to AIs as well.
None of these considerations depend on whether the AIs feel pleasure or pain. For instance, the prohibition on deception depends, not on the sentience of the listener, but on whether the listener trusts the speaker’s testimony.
None of these moral considerations are dispositive — they may be trumped by other considerations — but we risk a moral catastrophe if we ignore them entirely.
Is that right?
Yep, Pareto is violated, though how severely it’s violated is limited by human psychology.
For example, in your Alice/Bob scenario, would I desire a lifetime of 98 utils then 100 utils over a lifetime with 99 utils then 97 utils? Maybe idk, I don’t really understand these abstract numbers very much, which is part of the motivation for replacing them entirely with personal outcomes. But I can certainly imagine I’d take some offer like this, violating pareto. On the plus side, humans are not so imprudent to accept extreme suffering just to reshuffle different experiences in their life.
Secondly, recall that the model of human behaviour is a free variable in the theory. So to ensure higher conformity to pareto, we could…
Use the behaviour of someone with high delayed gratification.
Train the model (if it’s implemented as a neural network) to increase delayed gratification.
Remove the permutation-dependence using some idealisation procedure.
But these techniques (1 < 2 < 3) will result in increasingly “alien” optimisers. So there’s a trade-off between (1) avoiding human irrationalities and (2) robustness to ‘going off the rails’. (See Section 3.1.) I see realistic typical human behaviour on one extreme of the tradeoff, and argmax on the other.
If we should have preference ordering R, then R is rational (morality presumably does not require irrationality).
I think human behaviour is straight-up irrational, but I want to specify principles of social choice nonetheless. i.e. the motivation is to resolve carlsmith’s On the limits of idealized values.
now, if human behaviour is irrational (e.g. intransitive, incomplete, nonconsequentialist, imprudent, biased, etc), then my social planner (following LELO, or other aggregative principles) will be similarly irrational. this is pretty rough for aggregativism; I list it was the most severe objection, in section 3.1.
but to the extent that human behaviour is irrational, then the utilitarian principles (total, average, Rawls’ minmax) have a pretty rough time also, because they appeal to a personal utility function to add/average/minimise. idk where they get that if humans are irrational.
maybe you the utilitarian can say: “well, first we apply some idealisation procedure to human behaviour, to remove the irrationalities, and then extract a personal utility function, and then maximise the sum/average/minimum of the personal utility function”
but, if provided with a reasonable idealisation procedure, the aggregativist can play the same move: “well, first we apply the idealisation procedure to human behaviour, to remove the irrationalities, and then run LELO/HL/ROI using that idealised model of human behaviour.” i discuss this move in 3.2, but i’m wary about it. like, how alien is this idealised human? why does it have any moral authority? what if it’s just ‘gone off the rails’ so to speak?
it is a bit unclear how to ground discounting in LELO, because doing so requires that one specifies the order in which lives are concatenated and I am not sure there is a non-arbitrary way of doing so.
macaskill orders the population by birth date. this seems non-arbitrary-ish(?);[1] it gives the right result wrt to our permutation-dependent values; and anything else is subject to egyptologist objections, where to determine whether we should choose future A over B, we need to first check the population density of ancient egypt.
Loren sidesteps this the order-dependence of LELO with (imo) an unrealistically strong rationality condition.
- ^
if you’re worried about relativistic effects then use the reference frame of the social planner
- ^
I do prefer total utilitarianism to average utilitarianism,[1] but one thing that pulls me to average utilitarianism is the following case.
Let’s suppose Alice can choose either (A) create 1 copy at 10 utils, or (B) create 2 copies at 9 utils. Then average utilitarianism endorses (A), and total utilitarianism endorses (B). Now, if Alice knows she’s been created by a similar mechanism, and her option is correlated with the choice of her ancestor, and she hasn’t yet learned her own welfare, then EDT endorses picking (A). So that matches average utilitarianism.[2]
Basically, you’d be pleased to hear that all your ancestors were average utility maximisers, rather than total utility maximisers, once you “update on your own existence” (whatever that means). But also, I’m pretty confused by everything in this anthropics/decision theory/population ethics area. Like, the egyptology thing seems pretty counterintuitive, but acausal decision theories and anthropic considerations imply all kind of weird nonlocal effects, so idk if this is excessively fishy.
- ^
I think aggregative principles are generally better than utilitarian ones. I’m a fan of LELO in particular, which is roughly somewhere between total and average utilitarianism, leaning mostly to the former.
- ^
Maybe this also requires SSA??? Not sure.
- ^
We’re quite lucky that labs are building AI in pretty much the same way:
same paradigm (deep learning)
same architecture (transformer plus tweaks)
same dataset (entire internet text)
same loss (cross entropy)
same application (chatbot for the public)
Kids, I remember when people built models for different applications, with different architectures, different datasets, different loss functions, etc. And they say that once upon a time different paradigms co-existed — symbolic, deep learning, evolutionary, and more!
This sameness has two advantages:
-
Firstly, it correlates catastrophe. If you have four labs doing the same thing, then we’ll go extinct if that one thing is sufficiently dangerous. But if the four labs are doing four different things, then we’ll go extinct if any of those four things are sufficiently dangerous, which is more likely.
-
It helps ai safety researchers because they only need to study one thing, not a dozen. For example, mech interp is lucky that everyone is using transformers. It’d be much harder to do mech interp if people were using LSTMs, RNNs, CNNs, SVMs, etc. And imagine how much harder mech interp would be if some labs were using deep learning, and others were using symbolic ai!
Implications:
One downside of closed research is it decorrelates the activity of the labs.
I’m more worried by Deepmind than Meta, xAI, Anthropic, or OpenAI. Their research seems less correlated with the other labs, so even though they’re further behind than Anthropic or OpenAI, they contribute more counterfactual risk.
I was worried when Elon announced xAI, because he implied it was gonna be a stem ai (e.g. he wanted it to prove Riemann Hypothesis). This unique application would’ve resulted in a unique design, contributing decorrelated risk. Luckily, xAI switched to building AI in the same way as the other labs — the only difference is Elon wants less “woke” stuff.
Let me know if I’m thinking about this all wrong.
this is common in philosophy, where “learning” often results in more confusion. or in maths, where the proof for a trivial proposition is unreasonably deep, e.g. Jordan curve theorem.
+1 to “shallow clarity”.
I wouldn’t be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger “gap” between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.
By “gap” I might mean any of the following:
external behaviour
internal mechanisms
subjective phenomenological experience
phenotype (if a human’s phenotype extends into their tools)
evolutionary history (if we consider cultural/memetic evolution as well as genetic).
Here are the countries with populations within 0.9 – 2.1 million: Slovenia, Latvia, North Macedonia, Guinea-Bissau, Kosovo, Bahrain, Equatorial Guinea, Trinidad and Tobago, Estonia, East Timor, Mauritius, Eswatini, Djibouti, Cyprus.
When I consider my inherent value for diversity (or richness, complexity, variety, novelty, etc), I care about these countries more than beetles. And I think that this preference would grow if I was more familiar with each individual beetle and each individual person in these countries.
Problems in population ethics (are 2 lives at 2 utility better than 1 life at 3 utility?) are similar to problems about lifespan of a single person (is it better to live 2 years with 2 utility per year than 1 year with 3 utility per year?)
This correspondence is formalised in the “Live Every Life Once” principle, which states that a social planner should make decisions as if they face the concatenation of every individual’s life in sequence.[1] So, roughly speaking, the “goodness” of a social outcome , in which individuals face the personal outcomes , is the “desirability” of the single personal outcome . (Here, denotes the concatenation of personal outcomes and .)
The LELO principle endorses somewhat different choices than total utilitarianism or average utilitarianism.
Here’s three examples (two you mention):
(1) Novelty
As you mention, it values novelty where the utilitarian principles don’t. This is because self-interested humans value novelty in their own life.
Thirdly, [Monoidal Rationality of Personal Utility][2] rules out path-dependent values.
Informally, whether I value a future more than a future must be independent of my past experiences. But this is an unrealistic assumption about human values, as illustrated in the following examples. If denotes reading Moby Dick and denotes reading Oliver Twist, then humans seem to value less than but value more than . This is because humans value reading a book higher if they haven’t already read it, due to an inherent value for novelty in reading material.
In other words, if the self-interested human’s personal utility function places inherent value on intertemporal heterogeneity of some variable (e.g. reading material), then the social utility function that LELO exhibits will place an inherent value on the interpersonal heterogeneity of the same variable. Hence, it’s better if Alice and Bob read different books than the same book.
(2) Tradition
Note also that the opposite effect also occurs:
Alternatively, if and denote being married to two different people, then humans seem to value more than but value less than . This is because humans value being married to someone for a decade higher if they’ve already been married to them, due to an inherent value for consistency in relationships.
— ibid.
That is, if the personal utility function places inherent value on intertemporal homogeneity of some variable (e.g. religious practice), then the social utility function that LELO exhibits will place an inherent value on the interpersonal homogeneity of the same variable. Hence, it’s better if Alice and Bob practice the same religion than different ones. So LELO can account valuing both diversity and tradition, whereas total/average utilitarianism can’t do either.
(3) Compromise on repugnant conclusion
You say “On the surface, this analogy seems to favor total utilitarianism.” I think that’s mostly right. LELO’s response to the Repugnant Conclusion is somewhere between total and average utilitarianism, leaning to the former.
Formally, when comparing a population of individuals with personal utilities to an alternative population of individuals with utilities , LELO ranks the first population as better if and only if a self-interested human would prefer to live the combined lifespan over . Do people generally prefer a longer life with moderate quality, or a shorter but sublimely happy existence? Most people’s preferences likely lie somewhere in between the extremes. This is is because personal utility of a concatenation of personal outcomes is not precisely the sum of the personal utilities of the outcomes being concatenated.
Hence, LELO endorses a compromise between total and average utilitarianism, better reflecting our normative intuitions. While not decisive, it is a mark in favour of aggregative principles as a basis for population ethics.
- ^
See:
Myself (2024), “Aggregative Principles of Social Justice”
Loren Fryxell (2024), “XU”
MacAskill (2022), “What We Owe the Future”
- ^
MRPU is a condition that states that the personal utility function of a self-interested human satisfies the axiom , which is necessary for LELO to be mathematically equivalent to total utilitarianism.
- ^
which principles of social justice agrees with (i) adding bad live is bad, but disagrees with (ii) adding good lives is good?
total utilitarianism agrees with both (i) and (ii).
average utilitarianism can agree with any of the combination: both (i) and (ii); neither (i) nor (ii); only (i) and not (ii). the combination depends on the existing average utility, because average utilitarianism obliges creating lives above the existing average and forbids creating lives below the existing average.
Rawls’ difference principle (maximise minimum utility) can agree with any of the combination: neither (i) nor (ii); only (i) and not (ii). this is because adding lives is never good (bc it could never increase minimum utility), and adding bad lives is bad iff those lives are below-minimum.
so you’re right that utilitarianism doesn’t match those intuitions. none of the three principles discussed reliably endorse (i) and reject (ii).
now consider aggregativism. you’ll get asymmetry between (i) and (ii) depending on then social zeta function mapping social outcomes to personal outcomes, and on the model of self-interested human behaviour.
let‘s examine LELO (i.e. the social zeta function maps a social outcome to the concatenation of all individuals’ lives) and our model of self-interested human behaviour is Alice (described below).
suppose Alice expects 80 year lives of comfortable fulfilling life.
would she pay to live 85 years instead, with 5 of those years in ecstatic joy? probably.
would she pay to avoid living 85 years instead, with 5 of those years in horrendous torture? probably.
there’s probably some asymmetry in Alice’s willingness of pay. i think humans are somewhat more misery-averse than joy-seeking. it’s not a 50-50 symmetry, nor a 0-100 asymmetry, maybe a 30-70 asymmetry? idk, this is an empirical psychological fact.
anyway, the aggregative principle (generated by LELO+Alice) says that the social planner should have the same attitudes towards social outcomes that Alice has towards the concatenation of lives in those social outcomes. so the social planner would pay to add joyful lives, and pay to avoid adding miserable lives, and there should be exactly as much willingness-to-pay asymmetry as Alice (our self-interested human) exhibits.
thanks for comments, gustav
I only skimmed the post, so I may have missed something, but it seems to me that this post underemphasizes the fact that both Harsanyi’s Lottery and LELO imply utilitarianism under plausible assumptions about rationality.
the rationality conditions are pretty decent model of human behaviour, but they’re only approximations. you’re right that if the approximation is perfect then aggregativism is mathematically equivalent to utilitarianism, which does render some of these advantages/objections moot. but I don’t know how close the approximations are (that’s an empirical question).
i kinda see aggregativism vs utilitarianism as a bundle of claims of the following form:
humans aren’t perfectly consequentialist, and aggregativism answers the question “how consequentialist should our moral theory be?” with “exactly as consequentialist as self-interested humans are.”
humans have an inaction bias, and aggregativism answers the question “how inaction-biased should our moral theory be?” with “exactly as inaction-biased as self-interested humans are.”
humans are time-discounting, and aggregativism answers the question “how time-discounting should our moral theory be?” with “exactly as time-discounting as self-interested humans are.”
humans are risk-averse, and aggregativism answers the question “how risk-averse should our moral theory be?” with “exactly as risk-averse as self-interested humans are.”
and so on
the purpose of the social zeta function is simply to map social outcomes (the object of our moral attitudes) to personal outcomes (the object the self-interested human’s attitudes) so this bundle of claims type-checks.
Also, at least some of the advantages of aggregativism that you mention are easily incorporated into utilitarianism. For example, what is achieved by adopting LELO with exponential time-discounting in Section 2.5.1 can also be achieved by adopting discounted utilitarianism (rather than unweighted total utilitarianism).
yeah that’s true, two quick thoughts:
i suspect exponential time-discounting was added to total utilitarianism because it’s a good model of self-interested human behaviour. aggregativism says “let’s do this with everything”, i.e. we modify utilitarianism in all the ways that we think self-interested humans behave.
suppose self-interested humans do time-discounting, then LELO would approximate total utilitarianism with discounting in population time, not calender time. that is, a future generation is discounted by the sum of lifetimes of each preceding generation. (if the calendar time for an event is then the population time for the event is where is the population size at time . I first heard this concept in this Greaves talk.) if you’re gonna adopt discounted utilitarianism, then population-time-discounted utilitarianism makes much more sense to me than calendar-time-discounted utilitarianism, and the fact that LELO gives the right answer here is a case in favour of it.
A final tiny comment: LELO has a long history, going back to at least C.I. Lewis’s ” An Analysis of Knowledge and Valuation”, though the term “LELO” was coined by my colleague Loren Fryxell (Fryxell 2024). It’s probably worth adding citations to these.
I mention Loren’s paper in the footnote of Part 1. i’ll cite him in part 2 and 3 also, thanks for the reminder.
Three articles, but the last is most relevant to you:
I admire the Shard Theory crowd for the following reason: They have idiosyncratic intuitions about deep learning and they’re keen to tell you how those intuitions should shift you on various alignment-relevant questions.
For example, “How likely is scheming?”, “How likely is sharp left turn?”, “How likely is deception?”, “How likely is X technique to work?”, “Will AIs acausally trade?”, etc.
These aren’t rigorous theorems or anything, just half-baked guesses. But they do actually say whether their intuitions will, on the margin, make someone more sceptical or more confident in these outcomes, relative to the median bundle of intuitions.
The ideas ‘pay rent’.
Appraising aggregativism and utilitarianism
Aggregative principles approximate utilitarian principles
tbh, Lewis’s account of counterfactual is a bit defective, compared with (e.g.) Pearl’s
Suppose Alice and Bob throw a rock at a fragile window, Alice’s rock hits the window first, smashing it.
Then the following seems reasonable:
Alice throwing the rock caused the window to smash. True.
Were Alice ot throw the rock, then the window would’ve smashed. True.
Were Alice not to throw the rock, then the window would’ve not smashed. False.
By (3), the window smashing does not causally depend on Alice throwing the rock.
Edit: Wait, I see what you mean. Fixed definition.
For Lewis,for all. In other words, the counterfactual proposition “wereto occur thenwould’ve occurred” is necessarily true ifis necessarily false. For example, Lewis thinks “were 1+1=3, then Elizabeth I would’ve married” is true. This means thatmay be empty for all neighbourhoods, yetis nonetheless true at.Source: David Lewis (1973), Counterfactuals. Link:https://perso.uclouvain.be/peter.verdee/counterfactuals/lewis.pdfOtherwise your later example doesn’t make sense.Elaborate?
If there’s a causal chain from c to d to e, then d causally depends on c, and e causally depends on d, so if c were to not occur, d would not occur, and if d were to not occur, e would not occur
On Lewis’s account of counterfactuals, this isn’t true, i.e. causal dependence is non-transitive. Hence, he defines causation as the transitive closure of causal dependence.
Lewis’ semantics
Let be a set of worlds. A proposition is characterised by the subset of worlds in which the proposition is true.
Moreover, assume each world induces an ordering over worlds, where means that world is closer to than . Informally, if the actual world is , then is a smaller deviation than . We assume , i.e. no world is closer to the actual world than the actual world.
For each , a “neighbourhood” around is a downwards-closed set of the preorder . That is, a neighbourhood around is some set such that and for all and , if then . Intuitively, if a neighbourhood around contains some world then it contains all worlds closer to than . Let denote the neighbourhoods of .
Negation
Let denote the proposition ” is not true”. This is defined by the complement subset .
Counterfactuals
We can define counterfactuals as follows. Given two propositions and , let denote the proposition “were to happen then would’ve happened”. If we consider as subsets, then we define as the subset . That’s a mouthful, but basically, is true at some world if
(1) ” is possible” is globally false, i.e.
(2) or ” is possible and is necessary” is locally true, i.e. true in some neighbourhood .
Intuitively, to check whether the proposition “were to occur then would’ve occurred” is true at , we must search successively larger neighbourhoods around until we find a neighbourhood containing an -world, and then check that all -worlds are -worlds in that neighbourhood. If we don’t find any -worlds, then we also count that as success.
Causal dependence
Let denote the proposition ” causally depends on ”. This is defined as the subset
Nontransitivity of causal dependence
We can see that is not a transitive relation. Imagine with the ordering given by . Then and but not .
Informal counterexample
Imagine I’m in a casino, I have million-to-one odds of winning small and billion-to-one odds of winning big.
Winning something causally depends on winning big:
Were I to win big, then I would’ve won something. (Trivial.)
Were I to not win big, then I would’ve not won something. (Because winning nothing is more likely than winning small.)
Winning small causally depends on winning something:
Were I win something, then I would’ve won small. (Because winning small is more likely than winning big.)
Were I to not win something, then I would’ve not won small. (Trivial.)
Winning small doesn’t causally depend on winning big:
Were I to win big, then I would’ve won small. (WRONG.)
Were I to not win big, then I would’ve not won small. (Because winning nothing is more likely than winning small.)
imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”.
most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money.
if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that.
many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious.
we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is.
“don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it.
consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones.
i care about these moral considerations as a brute fact. i would prefer a world of pzombies where everyone is treating each other with respect and dignity, over a world of pzombies where everyone was exploiting each other.
many of these moral considerations are part of the morality of fellow humans. i want to coordinate with those humans, so i’ll push their moral considerations.
the moral circle should be as big as possible. what does it mean to say “you’re outside my moral circle”? it doesn’t mean “i will harm/exploit you” because you might harm/exploit people within your moral circle also. rather, it means something much stronger. more like “my actions are in no way influenced by their effect on you”. but zero influence is a high bar to meet.