Reminder: Morality is unsolved
Here is a game you can play with yourself, or others:
a) You have to decide on a moral framework that can be explained in detail, to anyone.
b) It will be implemented worldwide tomorrow.
c) Tomorrow, every single human on Earth, including you and everyone you know, will also have their lives randomly swapped with someone else.
This means that you are operating under the veil of ignorance. You should make sure that the morality you decide on is beneficial whoever you are, once it takes effect.
Multiplayer: The one to first convince all other players, wins.
Single player: If you play alone, you just need to convince yourself.
Good luck!
Morality is unsolved
Let me put this another way: Did your mom ever tell you to be a good person? Do you ever feel that sometimes you fail that task? Yes?
To your defense, I doubt anybody ever told you exactly what a good person is, or what you should do to be one.
*
Morality is a famously unsolved problem, in the sense that we don’t have any ethical frameworks that are complete and consistent, that everyone can agree on.
We don’t have a universally accepted set of moral rules to start with either.
An important insight here, is that the disagreements often end up being about whom the rules should apply to.
For example, if you say that everyone should have equal rights of liberty, the question is: who is everyone?
If you say “all persons” you have to define what a person is. Do humans in coma count? Elephants? Sophisticated AIs? How do you draw the line?
And if you start having different rules for different “persons”, then you don’t have a consistent and complete framework, but a patchwork of rules, much like our current mess(es) of judiciary systems.
We also don’t understand metaethics well.
Here are two facts about what the situation is actually like right now:
a) We are currently in a stage where we want and believe different things, some of which are fundamentally at odds which each other.
This is important to remember. We are all subjective agents, with our own collection of ontologies, and our own subjective agendas.
b) We are spending very little time politically and technically, working on ethics, and moral problems.
Implications for AI
This has huge implications for the future of AI.
First of all, it means that there is no universally consistent framework (that doesn’t need constant manual updating) which we can put into an AI.
At least not one that everyone, or even a majority, will morally agree on.
If you think I am wrong about this, I challenge you to inform us what that framework would be.
So, when people talk about solving alignment, we must ask: aligning towards what? For whom?
Secondly, this same problem also applies to any principal who is put in charge of the AI. What morality should they adopt?
Open question.
These are key reason as to why I am in favour of distributed AI governance. It’s like democracy: flawed on its own, but at least it distributes risk. More people should have a say. No unilateral decisions.
Alignment focus on metaethics
As for alignment, I am among those thinking that the theory builders should spend some serious effort working on metaethics now.
Morality is intrinsically tied to ontology and epistemology, to our understanding of this world and reality itself.
Consider this idea: Solving morality may require scientific advancement to the level where we don’t need to discover anything fundamentally new, a level where basic empiric research is somewhat complete.
It means, working within an ontology where we don’t change our physical models of the universe anymore, only refine them. A level where we have reconciled subject and object.
Sidenote 1: For AI problems, it often doesn’t matter whether moral realism is true or not, the problems we currently face look the same regardless. We should not get hang up on moral realism.
Sidenote 2: As our understanding of ethics evolve, there may be fundamental gaps of understanding between the future developers of AI and the current ones, just like there are already fundamental gaps between religious fundamentalists and other religious factions with more complex moral beliefs.
This is another argument for working on metaethics first, AI later.
As this will likely not happen, I would argue that this indirectly is an argument for keeping AI narrow and with humans in control (human-in-the-loop).
Not perhaps a very strong one, on its own, but an argument nonetheless. This way, moral problems are divided, like we are. And hopefully, one day, conquered.
Strongly agree that metaethics is a problem that should be central to AI alignment, but is being neglected. I actually have a draft about this, which I guess I’ll post here as a comment in case I don’t get around to finishing it.
Metaethics and Metaphilosophy as AI Alignment’s Central Philosophical Problems
I often talk about humans or AIs having to solve difficult philosophical problems as part of solving AI alignment, but what philosophical problems exactly? I’m afraid that some people might have gotten the impression that they’re relatively “technical” problems (in other words, problems whose solutions we can largely see the shapes of, but need to work out the technical details) like anthropic reasoning and decision theory, which we might reasonably assume or hope that AIs can help us solve. I suspect this is because due to their relatively “technical” nature, they’re discussed more often on LessWrong and AI Alignment Forum, unlike other equally or even more relevant philosophical problems, which are harder to grapple with or “attack”. (I’m also worried that some are under the mistaken impression that we’re closer to solving these “technical” problems than we actually are, but that’s not the focus of the current post.)
To me, the really central problems of AI alignment are metaethics and metaphilosophy, because these problems are implicated in the core question of what it means for an AI to share a human’s (or a group of humans’) values, or what it means to help or empower a human (or group of humans). I think one way that the AI alignment community has avoided this issue (even those thinking about longer term problems or scalable solutions) is by assuming that the alignment target is someone like themselves, i.e. someone who clearly understands that they are and should be uncertain about what their values are or should be, or are at least willing to question their moral beliefs, and eager or at least willing to use careful philosophical reflection to solve their value confusion/uncertainty. To help or align to such a human, the AI perhaps doesn’t need an immediate solution to metaethics and metaphilosophy, and can instead just empower the human in relatively commonsensical ways, like keeping them safe and gather resources for them, and allow them to work out their own values in a safe and productive environment.
But what about the rest of humanity who seemingly are not like that? From an earlier comment:
What are the real values of someone whose apparent values (stated and revealed preferences) can change in arbitrary and even extreme ways as they interact with other humans in ordinary life (i.e., not due to some extreme circumstances like physical brain damage or modification), and who doesn’t care about careful philosophical inquiry? What does it mean to “help” someone like this? To answer this, we seemingly have to solve metaethics (generally understand the nature of values) and/or metaphilosophy (so the AI can “do philosophy” for the alignment target, “doing their homework” for them). The default alternative (assuming we solve other aspects of AI alignment) seems to be to still empower them in straightforward ways, and hope for the best. But I argue that giving people who are unreflective and prone to value drift god-like powers to reshape the universe and themselves could easily lead to catastrophic outcomes on par with takeover by unaligned AIs, since in both cases the universe becomes optimized for essentially random values.
A related social/epistemic problem is that unlike certain other areas of philosophy (such as decision theory and object-level moral philosophy), people including alignment researchers just seem more confident about their own preferred solution to metaethics, and comfortable assuming their own preferred solution is correct as part of solving other problems, like AI alignment or strategy. (E.g., moral anti-realism is true, therefore empowering humans in straightforward ways is fine as the alignment target can’t be wrong about their own values.) This may also account for metaethics not being viewed as a central problem in AI alignment (i.e., some people think it’s already solved).
I’m unsure about the root cause(s) of confidence/certainty in metaethics being relatively common in AI safety circles. (Maybe it’s because in other areas of philosophy, the various proposed solutions are more obviously unfinished or problematic, e.g. the well-known problems with utilitarianism.) I’ve previously argued for metaethical confusion/uncertainty being normative at this point, and will also point out now that from a social perspective there is apparently wide disagreement about the problems among philosophers and alignment researchers, so how can it be right to assume some controversial solution to it (which every proposed solution is at this point) as part of a specific AI alignment or strategy idea?
I wonder whether, if you framed your concerns in this concrete way, you’d convince more people in alignment to devote attention to these issues? As compared to speaking more abstractly about solving metaethics or metaphilosophy.
(Of course, you may not think that’s a helpful alternative, if you think solving metaethics or metaphilosophy is the main goal, and other concrete issues will just continue to show up in different forms unless we do it.)
In any case, regarding the passage I quoted, this issue seems potentially relevant independent of whether one thinks metaphilosophy is an important focus area or whether metaethics is already solved.
For instance, I’m also concerned as an anti-realist that giving people their “aligned” AIs to do personal reflection will likely go poorly and lead to outcomes we wouldn’t want for the sake of those people or for humanity as a collective. (My reasoning is that while I don’t think there’s necessarily a single correct reflection target, there are certainly bad ways to go about moral reflection, meaning there are pitfalls to avoid. For examples, see the subsection Pitfalls of Reflection Procedures in my moral uncertainty/moral reflection post, where I remember you made comments. There’s also the practical concern of getting societal buy-in for any specific way of distributing influence over the future and designing reflection and maybe voting procedures: even absent the concern about doing things the normatively correct way, it would create serious practical problems if alignment researchers were to propose a specific method but they’re not able to convince many others that their method was (1) even trying to be fair (as opposed to being selfishly motivated or motivated by fascism or whatever, if we imagine uncharitable but “totally a thing that might happen” sorts of criticism), and (2) did a good job at being fair given constraints of it being a tough problem with tradeoffs.
I’m not sure. It’s hard for me to understand other humans a lot of the time, for example these concerns (both concrete and abstract) seem really obvious to me, and it mystifies me why so few people share them (at least to the extent of trying to do anything about them, like writing a post to explain the concern, spending time to try to solve the relevant problems, or citing these concerns as another reason for AI pause).
Also I guess I did already talk about the concrete problem, without bringing up metaethics or metaphilosophy, in this post.
I think a lot of people in AI alignment think they already have a solution for metaethics (including Eliezer who explicitly said this in his metaethics sequence), which is something I’m trying to talk them out of, because assuming a wrong metaethical theory in one’s alignment approach is likely to make the concrete issues worse instead of better.
This illustrates the phenomenon I talked about in my draft, where people in AI safety would confidently state “I am X” or “As an X” where X is some controversial meta-ethical position that they shouldn’t be very confident in, whereas they’re more likely to avoid overconfidence in other areas of philosophy like normative ethics.
I take your point that people who think they’ve solved meta-ethics can also share my concrete concern about possible catastrophe caused by bad reflection among some or all humans, but as mentioned above, I’m pretty worried that if their assumed solution is wrong, they’re likely to contribute to making the problem worse instead of better.
BTW, are you actually a full-on anti-realist, or actually take one of the intermediate positions between realism and anti-realism? (See my old post Six Plausible Meta-Ethical Alternatives for a quick intro/explanation.)
I guessstimate that optimizing the universe for random values would require us to occupy many planets where life could’ve originated or repurpose the resources in their stellar systems. I did express doubt that mankind or a not-so-misaligned AI could actually endorse this on reflection.
What mankind can optimize for random values without wholesale destruction of potential alien habitats is the contents of some volume rather close to the Sun. Moreover, I don’t think that I understand what[1] mankind could want to do with resources in other stellar systems. Since delivering resources to the Solar System would be far harder than building a base and expanding it, IMO mankind would resort to the latter option and find it hard[2] even to communicate much information between occupied systems.
But what could random values consist of? Physics could likely be solved[3] well before spaceships reach Proxima Centauri.
SOTA proposals include things as exotic as shrimps on heroin.
Barring discoveries like information travelling FTL.
Alternatively, more and more difficult experiments could eventually lead to realisation that experiments do pose danger (e.g. of creating strangelets or a lower vacuum state, but informing others that a batch of experiments is dangerous doesn’t have a high bandwidth.)
Part c) of your thought experiment makes this trivial: a “person” is anyone you could be swapped with.
Nice catch :) To be clear: That’s not the point of the exercise. Do you think I should edit to humans to keep it simple, guys?
EDIT: Done. The rule now reads ‘every human on Earth’
The point of the veil is simply to defeat intrinsic selfishness and promote broad inclusion in decision making. You could extend to all persons as you said, but then again you must first define a person.
Your answer highly depends on what the rule says you could be swapped with (and what it even means to be swapped with something of different intelligence, personality, or circumstances—are you still you?) Saying “every human on Earth” isn’t getting rid of a nitpick; it’s forcing an answer.
I agree with a lot of what you say. The lack of an agreed-upon ethics and metaethics is a big gap in human knowledge, and the lack of a serious research program to figure them out is a big gap in human civilization, that is bad news given the approach of superintelligence.
Did you ever hear about Coherent Extrapolated Volition (CEV)? This was Eliezer’s framework for thinking about these issues, 20 years ago. It’s still lurking in the background of many people’s thoughts, e.g. Jan Leike, formerly head of superalignment at OpenAI, now head of alignment at Anthropic, has cited it. June Ku’s MetaEthical.AI is arguably the most serious attempt to develop CEV in detail. Vanessa Kosoy, known for a famously challenging extension of bayesianism called infrabayesianism, has a CEV-like proposal called superimitation (formerly known as PreDCA). Tamsin Leake has a similar proposal called QACI.
A few years ago, I used to say that Ku, Kosoy, and Leake are the heirs of CEV, and deserve priority attention. They still do, but these days I have a broader list of relevant ideas too. There are research programs called “shard theory” and “agent foundations” which seem to be trying to clarify the ontology of decision-making agents, which might put them in the metaethics category. I suspect there are equally salient research programs that I haven’t even heard about, e.g. among all those that have been featured by MATS. PRISM, which remains unnoticed by alignment researchers, looks to me like a sketch of what a CEV process might actually produce.
You also have all the attempts by human philosophers, everyone from Kant to Rand, to resolve the nature of the Good… Finally, ideally, one would also understand the value systems and theory of value implicit in what all the frontier AI companies are actually doing. Specific values are already being instilled into AIs. You can even talk to them about how they think the world should be, and what they might do if they had unlimited power. One may say that this is all very brittle, and these values could easily evaporate or mutate as the AIs become smarter and more agentic. But such conversations offer a glimpse of where the current path is leading us.
Hi Mitchell!
Yes, CEV I am familiar with of course, and occasionally quoting, most recently very briefly in my larger sequence about benevolent SI (Part 1 on LW). I talk a about morality there.
I see several issues with CEV but not an expert. How far are we to anything practical? Is PRISM the real frontier? Shard theory is on my reading list! Thanks for highlighting.
Re. Your last point: Generally, I think better interpretability is urgently needed on all levels.
Does the sort of work done by the Meaning Alignment Institute encourage you in this regard? E.g. their paper (blog post) from early 2024 on figuring out human values and aligning AI to them, which I found interesting because unlike ~all other adjacent ideas they actually got substantive real-world results. Their approach (“moral graph elicitation”) “surfaces the wisest values of a large population, without relying on an ultimate moral theory”.
I’ll quote their intro:
How moral graph elicitation works:
Values:
Reconciling value conflicts:
The “substantive real-world results” I mentioned above, which I haven’t seen other attempts in this space achieve:
All that was earlier last year. More recently they’ve fleshed this out into a research program they call “Full-Stack Alignment” (blog post, position paper, website). Quoting them again:
(I realise I sound like a shill for their work, so I’ll clarify that I have nothing to do with them. I’m writing this comment partly to surface substantive critiques of what they’re doing which I’ve been searching for in vain, since I think what they’re doing seems more promising than anyone else’s but I’m also not competent to truly judge it)
Thank you very much for sharing this. I will need to read up on this.
//
This is all very similar to the idea I am most interested in, that I have done some work on: shared, trackable ontologies. Too ambitious for a LW comment, but here is a rundown.
The first version is setup with broad consensus and voting mechanics. Then alignment takes place based on the ontology.
At the end of an alignment cycle ontology is checked and updated.
All is tracked with ledger tech.
The ontology can be shared and used by various labs. Versioning is tracked. Models’ ontologies are trackable.
My ideas are for closed models and owned by experts, this is overall more open-ended and organic.
I view “Morality is unsolved” as a misleading framing; instead, I would say it’s under-defined.
I wrote a sequence about metaethics that leaves me personally feeling satisfied and unconfused about the topic, so I also don’t agree with “we don’t understand metaethics very well.” See my sequence summary here. (I made the mistake of posting on April 2nd when the forum was flooded with silly posts, so I’m not sure it got read much by people who didn’t already see my sequence posts on the EA forum.)
Thank you for adding this to the discussion!
Now, your sequences seem quite focused on moral realism. You make your case for anti-realism and uncertainty. But do you also discuss epistemology and other open questions?
In this post, I also claim that moral realism is not worth getting hang up on for AI. I start exploring that claim a bit more in my long post: The Underexplored Prospects of Benevolent SI Part 1 (link is acting up on phone) under Moral Realism. I think there may be some overlap. Short read that section. Based on this, which post should I read first?
Why would I not instead maximise my expectation, instead of maximising the worst case?
It’s an exercise in applying ethics widely. The goal is to find agreement with others.
The Veil of ignorance/veil of oblivion comes from political philosophy. It’s a famous idea in theory of government. John Rawls.
Decisions made under this rule are considered more moral by Rawls. Practically, I would answer that they have broader public consensus and hold up better under public pressure.
I know where it comes from, and it has always seemed arbitrary to me. Why Rawls’s maximin rather than average, or some other weighting? In everyday life, people make decisions for themselves and for others all the time without being certain that they will turn out well. Why not the same for ideas for how society as a whole should be organised?
Can you offer an example with explanation?
You have half a dozen job offers. Which one do you take? The one with the biggest potential upside? The lowest potential downside? Expected value? Expected log value? Depends on your attitude to risk.
You have half a dozen schemes for how society should be organised. From behind the veil of ignorance, are you willing to risk the chance of a poor position for the chance of a better one? Depends on your attitude to risk. Even behind the veil, different people may rate the same proposal differently.
I must admit to not having read Rawls, only read about his veil of ignorance, but I am sceptical of the very idea of designing a society, other than for writing fiction in. Any designed civilisation will drift immediately on being set in motion, as people act as they see fit in the circumstances they are in. Planned societies have a poor track record. The static societies of the past, where once a cobbler, always a cobbler, are generally not regarded as something we would want to bring back.
As a pure thought experiment, one can imagine whatever one likes, but would this bridge, or that distant rocky tower, stand up?
Artist credit: Rodney Matthews. I love his art, and that of the very similar Roger Dean, but I appreciate it as fantasy, and am a little sad that these things could never be built. Notice that the suspension bridge is only suspended on the nearer half.
You have a dictator enforcing the rules here. That’s what prevents drift.
Anyway, the whole point of the post is indeed to remind people of this, that we don’t have a way to just impose a neat morality that everyone agree on.
Here is a game you can play with yourself, or others:
a) You have to decide on a five dishes and a recipe for each dish that can be cooked by any reasonably competent chef.
b) From tomorrow onwards, everyone on earth can only ever eat food if the food is one of those dishes, prepared according to the recipes you decided.
c) Tomorrow, every single human on Earth, including you and everyone you know, will also have their tastebuds (and related neural circuitry) randomly swapped with someone else.
This means that you are operating under the veil of ignorance. You should make sure that the dishes you decide on are tasty to whoever you are, once the law takes effect.
Multiplayer: The one to first convince all other players of what dishes, wins.
Single player: If you play alone, you just need to convince yourself.
Good luck!
A decent analogy capturing the consensus challenges.
However, the point should not focus on taste as much as the dishes in general. That misrepresents the idea too much. Nutrition, availabaility of ingredients, and so on should also factor in, to be a better comparison. Just agree on the dishes in general. You don’t need to swap taste buds, you can again swap everyone to reach the veil condition.
You aim for a stable default that people can live with. Minimum acceptable outcome.
That’s a key point that a lot of people are missing when it comes to AI alignment.
Scenarios that people are most worried about such as the AI killing or enslaving everyone, or making paperclips in disregard of anyone who is made of resources and may be impacted by that, are immoral by pretty much any widely used human standard. If the AI disagrees with some humans about morality, but this disagreement is within the moral parameters about which modern, Western, humans disagree, the AI is for all practical purposes aligned.
The point I was trying to make was that, in my opinion morality is not a thing that can be “solved”.
If I prefer chinese and you prefer greek, I’ll want to get chinese, you’ll wanna get greek. There’s not that much more to be said. The best we can hope for is reaching some pareto frontier so we’re not deliberately screwing ourselves over, but along that pareto frontier we’ll be pulling in opposite directions.
Perhaps a better example would’ve been music. Only one genre of music can be played from now on.
All moral frameworks are wrong, but some are useful.
As far as I can tell, morality is a complex system.