Please, Don’t Roll Your Own Metaethics
One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.
My boss thought he had to demonstrate the insecurity of the PRNG by coming up with a practical attack (i.e., a way to predict its future output based only on its past output, without knowing the secret key/seed). There were three permanent full time professional cryptographers working in the research department, but none of them specialized in cryptanalysis of symmetric cryptography (which covers such PRNGs) so it might have taken them some time to figure out an attack. My time was obviously less valuable and my boss probably thought I could benefit from the experience, so I got the assignment.
Up to that point I had no interest, knowledge, or experience with symmetric cryptanalysis either, but was still able to quickly demonstrate a clean attack on the proposed PRNG, which succeeded in convincing the proposer to give up and use an established algorithm. Experiences like this are so common, that everyone in cryptography quickly learns how easy it is to be overconfident about one’s own ideas, and many viscerally know the feeling of one’s brain betraying them with unjustified confidence. As a result, “don’t roll your own crypto” is deeply ingrained in the culture and in people’s minds.
If only it was so easy to establish something like this in “applied philosophy” fields, e.g., AI alignment! Alas, unlike in cryptography, it’s rarely possible to come up with “clean attacks” that clearly show that a philosophical idea is wrong or broken. The most that can usually be hoped for is to demonstrate some kind of implication that is counterintuitive or contradicts other popular ideas. But due to “one man’s modus ponens is another man’s modus tollens”, if someone is sufficiently willing to bite bullets, then it’s impossible to directly convince them that they’re wrong (or should be less confident) this way. This is made even harder because, unlike in cryptography, there are no universally accepted “standard libraries” of philosophy to fall back on. (My actual experiences attempting this, and almost always failing, are another reason why I’m so pessimistic about AI x-safety, even compared to most other x-risk concerned people.)
So I think I have to try something more meta, like drawing the above parallel with how easy is it to be overconfident in other fields, such as cryptography. Another meta line of argument is to consider how many people have strongly held, but mutually incompatible philosophical positions. Behind a veil of ignorance, wouldn’t you want everyone to be less confident in their own ideas? Or think “This isn’t likely to be a subjective question like morality/values might be, and what are the chances that I’m right and they’re all wrong? If I’m truly right why can’t I convince most others of this? Is there a reason or evidence that I’m much more rational or philosophically competent than they are?”
Unfortunately I’m pretty unsure any of these meta arguments will work either. If they do change anyone’s minds, please let me know in the comments or privately. Or if anyone has better ideas for how to spread a meme of “don’t roll your own metaethics”[1], please contribute. And of course counterarguments are welcome too, e.g., if people rolling their own metaethics is actually good, in a way that I’m overlooking.
- ^
To preempt a possible misunderstanding, I don’t mean “don’t try to think up new metaethical ideas”, but instead “don’t be so confident in your ideas that you’d be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way”. Similarly “don’t roll your own crypto” doesn’t mean never try to invent new cryptography, but rather don’t deploy it unless there has been extensive review, and consensus that it is likely to be secure.
What are you supposed to do other than roll your own metaethics?
“More research needed” but here are some ideas to start with:
Try to design alignment/safety schemes that are agnostic or don’t depend on controversial philosophical ideas. For certain areas that seem highly relevant and where there could potentially be hidden dependencies (such as metaethics), explicitly understand and explain why, under each plausible position that people currently hold, the alignment/safety scheme will result in a good or ok outcome. (E.g., why it leads to a good outcome regardless of whether moral realism or anti-realism is true, or any one of the other positions.)
Try to solve metaphilosophy, where potentially someone could make a breakthrough that everyone can agree is correct (after extensive review), which can then be used to speed up progress in all other philosophical fields. (This could also happen in another philosophical field, but seems a lot less likely due to prior efforts/history. I don’t think it’s very likely in metaphilosophy either, but perhaps worth a try, for those who may have very strong comparative advantage in this.)
If 1 and 2 look hard or impossible, make this clear to non-experts (your boss, company leaders/board, government officials, the public), don’t let them accept a “roll your own metaethics” solution, or a solution with implicit/hidden philosophical assumptions.
Support AI pause/stop.
Hmm, I like #1.
#2 feels like it’s injecting some frame that’s a bit weird to inject here (don’t roll your own metaethics… but rolling your own metaphilosophy is okay?)
But also, I’m suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
Maybe you missed my footnote?
and/or this part of my answer (emphasis added):
I think I mostly had alignment researchers (in and out of labs) as the target audience in mind, but it does seem relevant to others so perhaps I should expand the target audience?
I think I had missed this, but, it doesn’t resolve the confusion in my #2 note. (like, still seems like something is weird about saying “solve metaphilosophy such that every can agree is correct” is more worth considering than “solve metaethics such that everyone can agree is correct”. I can totally buy that they’re qualitatively different and maybe have some guesses for why you think that. But I don’t think the post spells out why and it doesn’t seem that obvious to me)
I hinted at it with “prior efforts/history”, but to spell it out more, metaethics seems to have a lot more effort gone into it in the past, so there’s less likely to be some kind of low hanging fruit in idea space, that once picked, everyone will agree is the right solution.
I’ve been banging my head against figuring out why this line of argument doesn’t seem convincing to many people for at least a couple of years. I think, ultimately, it’s probably because it feels defeatable by plans like “we will make AIs solve alignment for us, and solving alignment includes solving metaphilosophy & then object-level philosophy”. I think those plans are doomed in a pretty fundamental sense, but if you don’t think that, then they defeat many possible objections, including this one.
As they say: Everyone who is hopeful has their own reason for hope. Everyone who is doomful[1]...
In fact it’s not clear to me. I think there’s less variation, but still a fair bit.
There seem to me different categories of being doomful.
There are people who think that for theoretic reasons AI alignment is hard or impossible.
There are also people who are more focused practical issues like AI companies being run in a profit maximizing way and having no incentives to care for most of the population.
Saying, “You can’t AI box for theoretical reasons” is different from saying “Nobody will AI box for economic reasons”.
I think this fails to say how the analogy of cryptography transfers to metaethics. What properties of cryptography as a field make it such that you cannot roll your own. Is it just that many people have the experience of trying to come up with a croptographic scheme and failing, meanwhile there are perfectly good libraries nobody has found exploits to yet?
That doesn’t seem very analogous with metaethics. As you say, it is hard to decisively show a metaethical theory is “wrong”, and as far as I know there is no well-studied metaethical theory which has no exploits yet.
So what exactly is the analogy?
The analogy is that in both fields people are by default very prone to being overconfident. In cryptography this can be seen by the phenomenon of people (especially newcomers who haven’t learned the lesson) confidently proposing new cryptographic algorithms, which end up being way easier to break than they expect. In philosophy this is a bit trickier to demonstrate, but I think can be seen via a combination of:
people confidently holding positions that are incompatible with other people’s confident positions
tendency to “bite bullets” or accepting implications that are highly counterintuitive to others or even to themselves, instead of adopting more uncertainty
the total idea/argument space being exponentially vast and underexplored due to human limitations, therefore high confidence being unjustified in light of this
At risk of committing a Bulverism, I’ve noticed a tendency for people to see ethical bullet-biting as epistemically virtuous, like a demonstration of how rational/unswayed by emotion you are (biasing them to overconfidently bullet-bite). However, this makes less sense in ethics where intuitions like repugnance are a large proportion of what everything is based on in the first place.
There’s also the thing that the idea/argument space contains dæmons/attractors exploiting shortcomings of human cognition, thus making humans hold them with higher confidence than they would if they didn’t have those limitations.
I find this contrast between “biting bullets” and “adopting more uncertainty” strange. The two seem orthogonal to me, as in, I’ve ~just as frequently (if not more often) observed people overconfidently endorse their pretheoretic philosophical intuitions, in opposition to bullet-biting.
In my experience learning the viscereal sense that the space is dense with traps and spiders and poisonous things and what intuitively seems “basically sensible” often does not work. (I did some cryptography years ago)
The structural similarity seems to be there is a big difference in trying to do cryptography in a mode where you don’t assume what you are doing is subject to some adversarial pressure, and in the mode where it should work even if someone tries to attack it. The first one is easy, breaks easily, and it’s unclear why would you even try to do it.
In metaethics, I think it is somewhat easy to do it in the mode where you don’t assume it should be applied in some high-stakes, novel or tricky situations, like AI alignment, computer minds, multiverse, population ethics, anthropics, etc etc. The suggestions of normative ethical theories will converge for many mundane situations, so anything works, but it was not necessary to do metaethics.
By “metaethics,” do you mean something like “a theory of how humans should think about their values”?
I feel like I’ve seen that kind of usage on LW a bunch, but it’s atypical. In philosophy, “metaethics” has a thinner, less ambitious interpretation of answering something like, “What even are values, are they stance-independent, yes/no?”
And yeah, there is often a bit more nuance than that as you dive deeper into what philosophers in the various camps are exactly saying, but my point is that it’s not that common, and certainly not necessary, that “having confident metaethical views,” on the academic philosophy reading of “metaethics,” means something like “having strong and detailed opinions on how AI should go about figuring out human values.”
(And maybe you’d count this against academia, which would be somewhat fair, to be honest, because parts of “metaethics” in philosophy are even further removed from practicality, as they concern the analysis of the language behind moral claims, which, if we compare it to claims about the Biblical God and miracles, it would be like focusing way too much on whether the people who wrote the Bible thought they were describing real things or just metaphores, without directly trying to answer burning questions like “Does God exist?” or “Did Jesus live and perform miracles?”)
Anyway, I’m asking about this because I found the following paragraph hard to understand:
My best guess of what you might mean (low confidence) is the following:
You’re conceding that morality/values might be (to some degree) subjective, but you’re cautioning people from having strong views about “metaethics,” which you take to be the question of not just what morality/values even are, but also a bit more ambitiously: how to best reason about them and how to (e.g.) have AI help us think about what we’d want for ourselves and others.
Is that roughly correct?
Because if one goes with the “thin” interpretation of metaethics, then “having one’s own metaethics” could be as simple as believing some flavor of “morality/values are subjective,” and it feels like you, in the part I quoted, don’t sound like you’re too strongly opposed to just that stance in itself, necessarily.
I have also noticed that when you read the word ”metaethics” on Lesswrong it can mean anything that is in some way related to morality.
Mayby I should take it upon myself to write a short essay on metaethics and how it differs from normative ethics and why it may be of importance to AI alignment.
Please just write the standard library!
The problem is that we can’t. The closest thing we have is instead a collection of mutually exclusive ideas where at most one (possibly none) is correct, and we have no consensus as to which.
Okay (if possible), I want you to imagine I’m an AI system or similar and that you can give me resources in the context window that increase the probability of me making progress on problems you care about in the next 5 years. Do you have a reading list or similar for this sort of thing? (It seems hard to specify and so it might be easier to mention what resources can bring the ideas forth. I also recognize that this might be one of those applied knowledge things rather than a set of knowledge things.)
Also, if we take the cryptography lens seriously here, an implication might be that I should learn the existing off the shelf solutions in order to “not invent my own”. I do believe that there is no such thing as being truly agnostic to a meta-philosophy since you’re somehow implicitly projecting your own biases on to the world.
I’m gonna make this personally applicable to myself as that feels more skin in the game and less like a general exercise.
There are a couple of contexts to draw from here:
Traditional philosophy. (I’ve read the following):
A history of western philosophy
(Plato, Aristotles, Hume, Foucault, Spinoza, Russell, John Rawls, Dennett, Chalmers and a bunch of other non continental philosophers)
Eastern philosophy (I’ve read the following):
Buddhism, Daoism, Confucianism (Mainly tibetan buddhism here)
Modern more AI related philosophy (I’ve read the following):
Yudkowsky, Bostrom
(Not at first glance philosophy but): Michael Levin (Diverse Intelligence), Some Category Theory (Composition),
Which one is the one to double down on? How do they relate to learning more about meta ethics? Where am I missing things within my philosophy education?
(I’m not sure this is a productive road to go down but I would love to learn more about how to learn more about this.)
This preempted my misunderstanding! Well done and thank you : )
I think that in philosophy in general and metaethics in particular, the idea that since many people disagree one should not be confident in one’s ideas is wrong.
I’ll somewhat carefully spell out why I think this; a lot of this reasoning is obvious, but the core claim is that the intuitions people use in philosophy in order to ground their arguments are often wrong in predictable ways.
“One man’s modus ponens is another man’s modus tollens” is usually what is at the core of ongoing philosophical disagreements. Suppose A⟹B is universally agreed, A is somewhat intuitive to everyone but the degree to which that intuition is compelling varies, and B is somewhat unintuitive to everyone but the degree to which that intuition is compelling varies.
Then if anyone is to take a side on whether B is true or A is false, they must decide which bullet is worse to bite.
Debate and thought experiments can attempt to present either bullet in a more appealing way, but in the end both propositions are confidently found unacceptable to at least some people.
Now it is your job, observing this situation, to decide whether to be very uncertain about which bullet should be bitten, or to choose one to bite. How should you do it?
The answer is that you should ask how it came to be that there is a difference in the intuitions of people who believe B is true and those who say A is false. If you can understand the causes of those different intuitions, then you may be able decide which (if any) of them to be trusted.
Consider metaethics. The problems of mind-independence, moral ontology, normativity, internalism vs. externalism, etc. can all be framed in this way, and very roughly for the sake of this comment only (hold your objections since I would treat this more carefully in a post), collapsed into the same problem:
A. All facts are ultimately natural or descriptive.
B. Nothing is really right or wrong, better or worse, independent of human attitudes or conventions.
Again avoiding a careful philosophical treatment which we don’t have time for, I will just flag that the intuition behind a philosopher’s objection to B is highly suspect due to the fact that they are the product of a particular human social structure which rewards strong beliefs about right and wrong.
I will admit that this explanation for objections to B is not fully satisfying to me, although it is conceivable that it should be. There may be some other explanations for the objection—if anyone has ideas, I’d love to hear them.
But it is hard for me to imagine a pathway by which the intuition that B is false comes about as a result of B actually being false, although positing intelligent design might do the trick.
Can we: create a full list or map of ideas and after that add probabilities to each one?
Nice post, guess I agree. I think it’s even worse though: not only do at least some alignment researchers follow their own philosophy which is not universally accepted, it’s also a particularly niche philosophy, and one that potentially leads to human extinction itself.
The philosophy in question is of course longtermism. Longtermism holds two controversial assumptions:
Symmetric population ethics: we have to create as much happy conscious life as possible. It’s not just about making people happy, it’s also about making happy people. In philosophy, and outside philosophy, most people think this is bonkers (I’m one of them).
Conscious AIs are morally relevant beings.
These two assumptions together lead to the conclusion that we must max out on creating conscious AIs, and that if these AIs end up in a resource conflict with humans (over e.g. energy, space, or matter), the AIs should be prioritized, since they can deliver most happiness per J, m^3 or kg. This leads to extinction of all humans.
I don’t believe in ethical facts so even an ideology as, imo, bonkers as this one is not objectively false, I believe. However, I would really like alignment researchers and their house philosophers (looking at you, MacAskill) to distance themselves from extrapolating this idea all the way to human extinction. Beyond that bare minimum, I would like alignment researchers to start accepting democratic inputs in general.
Maybe democracy is the library you were looking for?
Would be nice if those disagreeing are saying why they’re actually disagreeing