If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main “claims to fame”:
Created the first general purpose open source cryptography programming library (Crypto++, 1995), motivated by AI risk and what’s now called “defensive acceleration”.
Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.
Strongly agree that metaethics is a problem that should be central to AI alignment, but is being neglected. I actually have a draft about this, which I guess I’ll post here as a comment in case I don’t get around to finishing it.
Metaethics and Metaphilosophy as AI Alignment’s Central Philosophical Problems
I often talk about humans or AIs having to solve difficult philosophical problems as part of solving AI alignment, but what philosophical problems exactly? I’m afraid that some people might have gotten the impression that they’re relatively “technical” problems (in other words, problems whose solutions we can largely see the shapes of, but need to work out the technical details) like anthropic reasoning and decision theory, which we might reasonably assume or hope that AIs can help us solve. I suspect this is because due to their relatively “technical” nature, they’re discussed more often on LessWrong and AI Alignment Forum, unlike other equally or even more relevant philosophical problems, which are harder to grapple with or “attack”. (I’m also worried that some are under the mistaken impression that we’re closer to solving these “technical” problems than we actually are, but that’s not the focus of the current post.)
To me, the really central problems of AI alignment are metaethics and metaphilosophy, because these problems are implicated in the core question of what it means for an AI to share a human’s (or a group of humans’) values, or what it means to help or empower a human (or group of humans). I think one way that the AI alignment community has avoided this issue (even those thinking about longer term problems or scalable solutions) is by assuming that the alignment target is someone like themselves, i.e. someone who clearly understands that they are and should be uncertain about what their values are or should be, or are at least willing to question their moral beliefs, and eager or at least willing to use careful philosophical reflection to solve their value confusion/uncertainty. To help or align to such a human, the AI perhaps doesn’t need an immediate solution to metaethics and metaphilosophy, and can instead just empower the human in relatively commonsensical ways, like keeping them safe and gather resources for them, and allow them to work out their own values in a safe and productive environment.
But what about the rest of humanity who seemingly are not like that? From an earlier comment:
What are the real values of someone whose apparent values (stated and revealed preferences) can change in arbitrary and even extreme ways as they interact with other humans in ordinary life (i.e., not due to some extreme circumstances like physical brain damage or modification), and who doesn’t care about careful philosophical inquiry? What does it mean to “help” someone like this? To answer this, we seemingly have to solve metaethics (generally understand the nature of values) and/or metaphilosophy (so the AI can “do philosophy” for the alignment target, “doing their homework” for them). The default alternative (assuming we solve other aspects of AI alignment) seems to be to still empower them in straightforward ways, and hope for the best. But I argue that giving people who are unreflective and prone to value drift god-like powers to reshape the universe and themselves could easily lead to catastrophic outcomes on par with takeover by unaligned AIs, since in both cases the universe becomes optimized for essentially random values.
A related social/epistemic problem is that unlike certain other areas of philosophy (such as decision theory and object-level moral philosophy), people including alignment researchers just seem more confident about their own preferred solution to metaethics, and comfortable assuming their own preferred solution is correct as part of solving other problems, like AI alignment or strategy. (E.g., moral anti-realism is true, therefore empowering humans in straightforward ways is fine as the alignment target can’t be wrong about their own values.) This may also account for metaethics not being viewed as a central problem in AI alignment (i.e., some people think it’s already solved).
I’m unsure about the root cause(s) of confidence/certainty in metaethics being relatively common in AI safety circles. (Maybe it’s because in other areas of philosophy, the various proposed solutions are more obviously unfinished or problematic, e.g. the well-known problems with utilitarianism.) I’ve previously argued for metaethical confusion/uncertainty being normative at this point, and will also point out now that from a social perspective there is apparently wide disagreement about the problems among philosophers and alignment researchers, so how can it be right to assume some controversial solution to it (which every proposed solution is at this point) as part of a specific AI alignment or strategy idea?