Wei Dai

Karma: 42,003

If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.

My main “claims to fame”:

Created the first general purpose open source cryptography programming library (Crypto++, 1995), motivated by AI risk and what’s now called “defensive acceleration”.
Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.

My Home Page

Wei Dai 24 Oct 2025 1:36 UTC
21 points
1
on: Reminder: Morality is unsolved
Strongly agree that metaethics is a problem that should be central to AI alignment, but is being neglected. I actually have a draft about this, which I guess I’ll post here as a comment in case I don’t get around to finishing it.
Metaethics and Metaphilosophy as AI Alignment’s Central Philosophical Problems
I often talk about humans or AIs having to solve difficult philosophical problems as part of solving AI alignment, but what philosophical problems exactly? I’m afraid that some people might have gotten the impression that they’re relatively “technical” problems (in other words, problems whose solutions we can largely see the shapes of, but need to work out the technical details) like anthropic reasoning and decision theory, which we might reasonably assume or hope that AIs can help us solve. I suspect this is because due to their relatively “technical” nature, they’re discussed more often on LessWrong and AI Alignment Forum, unlike other equally or even more relevant philosophical problems, which are harder to grapple with or “attack”. (I’m also worried that some are under the mistaken impression that we’re closer to solving these “technical” problems than we actually are, but that’s not the focus of the current post.)
To me, the really central problems of AI alignment are metaethics and metaphilosophy, because these problems are implicated in the core question of what it means for an AI to share a human’s (or a group of humans’) values, or what it means to help or empower a human (or group of humans). I think one way that the AI alignment community has avoided this issue (even those thinking about longer term problems or scalable solutions) is by assuming that the alignment target is someone like themselves, i.e. someone who clearly understands that they are and should be uncertain about what their values are or should be, or are at least willing to question their moral beliefs, and eager or at least willing to use careful philosophical reflection to solve their value confusion/uncertainty. To help or align to such a human, the AI perhaps doesn’t need an immediate solution to metaethics and metaphilosophy, and can instead just empower the human in relatively commonsensical ways, like keeping them safe and gather resources for them, and allow them to work out their own values in a safe and productive environment.
But what about the rest of humanity who seemingly are not like that? From an earlier comment:
I’ve been thinking a lot about the kind [of value drift] quoted in Morality is Scary. The way I would describe it now is that human morality is by default driven by a competitive status/signaling game, where often some random or historically contingent aspect of human value or motivation becomes the focal point of the game, and gets magnified/upweighted as a result of competitive dynamics, sometimes to an extreme, even absurd degree.
(Of course from the inside it doesn’t look absurd, but instead feels like moral progress. One example of this that I happened across recently is filial piety in China, which became more and more extreme over time, until someone cutting off a piece of their flesh to prepare a medicinal broth for an ailing parent was held up as a moral exemplar.)
Related to this is my realization is that the kind of philosophy you and I are familiar with (analytical philosophy, or more broadly careful/skeptical philosophy) doesn’t exist in most of the world and may only exist in Anglophone countries as a historical accident. There, about 10,000 practitioners exist who are funded but ignored by the rest of the population. To most of humanity, “philosophy” is exemplified by Confucius (morality is everyone faithfully playing their feudal roles) or Engels (communism, dialectical materialism). To us, this kind of “philosophy” is hand waving and make things up out of thin air, but to them, philosophy is learned from a young age and unquestioned. (Or if questioned, they’re liable to jump to some other equally hand-wavy “philosophy” like China’s move from Confucius to Engels.)
What are the real values of someone whose apparent values (stated and revealed preferences) can change in arbitrary and even extreme ways as they interact with other humans in ordinary life (i.e., not due to some extreme circumstances like physical brain damage or modification), and who doesn’t care about careful philosophical inquiry? What does it mean to “help” someone like this? To answer this, we seemingly have to solve metaethics (generally understand the nature of values) and/or metaphilosophy (so the AI can “do philosophy” for the alignment target, “doing their homework” for them). The default alternative (assuming we solve other aspects of AI alignment) seems to be to still empower them in straightforward ways, and hope for the best. But I argue that giving people who are unreflective and prone to value drift god-like powers to reshape the universe and themselves could easily lead to catastrophic outcomes on par with takeover by unaligned AIs, since in both cases the universe becomes optimized for essentially random values.
A related social/epistemic problem is that unlike certain other areas of philosophy (such as decision theory and object-level moral philosophy), people including alignment researchers just seem more confident about their own preferred solution to metaethics, and comfortable assuming their own preferred solution is correct as part of solving other problems, like AI alignment or strategy. (E.g., moral anti-realism is true, therefore empowering humans in straightforward ways is fine as the alignment target can’t be wrong about their own values.) This may also account for metaethics not being viewed as a central problem in AI alignment (i.e., some people think it’s already solved).
I’m unsure about the root cause(s) of confidence/certainty in metaethics being relatively common in AI safety circles. (Maybe it’s because in other areas of philosophy, the various proposed solutions are more obviously unfinished or problematic, e.g. the well-known problems with utilitarianism.) I’ve previously argued for metaethical confusion/uncertainty being normative at this point, and will also point out now that from a social perspective there is apparently wide disagreement about the problems among philosophers and alignment researchers, so how can it be right to assume some controversial solution to it (which every proposed solution is at this point) as part of a specific AI alignment or strategy idea?

Wei Dai 22 Oct 2025 19:11 UTC
LW: 9 AF: 4
0
AF
on: Wei Dai’s Shortform
I want to highlight a point I made in an EAF thread with Will MacAskill, which seems novel or at least underappreciated. For context, we’re discussing whether the risk vs time (in AI pause/slowdown) curve is concave or convex, or in other words, whether the marginal value of an AI pause increases or decreases with pause length. Here’s the whole comment for context, with the specific passage bolded:
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little?
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave^[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
1. The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
2. Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
3. Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced
What if the leaders can’t or shouldn’t trust the AI results?
1. ^
  I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.

Wei Dai 13 Oct 2025 8:53 UTC
5 points
13
in reply to: mako yass’s comment on: Wei Dai’s Shortform

That fully boils down to whether the experience includes a preference to be dead (or to have not been born).

I’m pretty doubtful about this. It seems totally possible that evolution gave us a desire to be alive, while also gave us a net welfare that’s negative. I mean we’re deluded by default about a lot of other things (e.g., think there are agents/gods everywhere in nature, don’t recognize that social status is a hugely important motivation behind everything we do), why not this too?

Wei Dai 13 Oct 2025 8:18 UTC
4 points
0
in reply to: cousin_it’s comment on: Wei Dai’s Shortform

Let’s take an area where you have something to say, like philosophy. Would you be willing to outsource that?

Outsourcing philosophy is the main thing I’ve been trying to do, or trying to figure out how to safely do, for decades at this point. I’ve written about it in various places, including this post and my pinned tweet on X. Quoting from the latter:

Among my first reactions upon hearing “artificial superintelligence” were “I can finally get answers to my favorite philosophical problems” followed by “How do I make sure the ASI actually answers them correctly?”

Aside from wanting to outsource philosophy to ASI, I’d also love to have more humans who could answer these questions for me. I think about this a fair bit and wrote some things down but don’t have any magic bullets.

(I currently think the best bet to eventually getting what I want is to encourage an AI pause along with genetic enhancements for human intelligence, have the enhanced humans solve metaphilosophy and other aspects of AI safety, then outsource the rest of philosophy to ASI, or have the enhanced humans decide what to do at that point.)

BTW I thought this would be a good test for how competent current AIs are at understanding someone’s perspective so I asked a bunch of them how Wei Dai would answer your question, and all of them got it wrong on the first try, except Claude Sonnet 4.5 which got it right on the first try but wrong on the second try. It seems like having my public content in their training data isn’t enough, and finding relevant info from the web and understanding nuance are still challenging for them. (GPT-5 essentially said I’d answer no because I wouldn’t trust current AIs enough, which is really missing the point despite having this whole thread as context.)

Wei Dai 12 Oct 2025 22:22 UTC
3 points
0
in reply to: mako yass’s comment on: Wei Dai’s Shortform
By negative value I mean negative utility, or an experience that’s worse than a neutral or null experience.

Wei Dai 12 Oct 2025 21:21 UTC
6 points
2
in reply to: mako yass’s comment on: Wei Dai’s Shortform
How do you come up with an encoding that covers all possible experiences? How do you determine which experiences have positive and negative values (and their amplitudes)? What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?

Wei Dai 12 Oct 2025 17:10 UTC
2 points
0
in reply to: cousin_it’s comment on: Wei Dai’s Shortform

Well, there’s no point in asking the AI to make me good at things if I’m the kind of person who will just keep asking the AI to do more things for me!

But I’m only asking the AI to do things for me because they’re too effortful or costly. If the AI made me good at these things with no extra effort or cost (versus asking the AI to do it) then why wouldn’t I do them myself? For example I’m pretty sure I’d love the experience of playing like a concert pianist, and would ask for this ability, if doing so involved minimal effort and cost.

On the practical side, I agree that atrophy and being addicted/exploited are risks/costs worth keeping in mind, but I’ve generally made tradeoffs more in the direction of using shortcuts to minimize “doing chores” (e.g., buying a GPS for my car as soon as they came out, giving up learning an instrument very early) and haven’t regretted it so far.

Wei Dai 11 Oct 2025 20:22 UTC
2 points
1
in reply to: cousin_it’s comment on: Wei Dai’s Shortform

If my value system is only about receiving stuff from the universe, then the logical endpoint is a kind of blob that just receives stuff and doesn’t even need a brain.

Unless one of the things you want to receive from the universe is to be like Leonardo da Vinci, or be able to do everything effortlessly and with extreme competence. Why “do chores” now if you can get to that endpoint either way, or maybe even more likely if you don’t “do chores” because it allows you to save on opportunity costs and better deploy your comparative advantage? (I can understand if you enjoy the time spent doing these activities, but by calling them “chores” you seem to be implying that you don’t?)

Wei Dai 11 Oct 2025 18:58 UTC
2 points
0
in reply to: cousin_it’s comment on: Wei Dai’s Shortform
Hmm, I find it hard to understand or appreciate this attitude. I can’t think of any chores that I intrinsically don’t want to outsource, only concerns that I may not be able to trust the results. What are some other examples of chores you do and don’t want to outsource? Do you have any pattern or explanation of where you draw the line? Do you think people who don’t mind outsourcing all their chores are wrong in some way?

Wei Dai 11 Oct 2025 17:02 UTC
110 points
77
on: Wei Dai’s Shortform
A clear mistake of early AI safety people is not emphasizing enough (or ignoring) the possibility that solving AI alignment (as a set of technical/philosophical problems) may not be feasible in the relevant time-frame, without a long AI pause. Some have subsequently changed their minds about pausing AI, but by not reflecting on and publicly acknowledging their initial mistakes, I think they are or will be partly responsible for others repeating similar mistakes.

Case in point is Will MacAskill’s recent Effective altruism in the age of AGI. Here’s my reply, copied from EA Forum:

I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
1. Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
2. Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
3. Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
What links here?
- AI #138 Part 2: Watch Out For Documents by Zvi (17 Oct 2025 11:50 UTC; 38 points)

Wei Dai 11 Oct 2025 14:39 UTC
2 points
0
in reply to: cousin_it’s comment on: Wei Dai’s Shortform
it’ll be even harder if I know the other person is responding to an AI-rewritten version of my comment, referring to an AI-summarized version of my profile, running AI hypotheticals on how I would react

I think all of these are better than the likely alternatives though, which is that
- I fail to understand someone’s comment or the reasoning/motivations behind their words, and most likely just move on (instead of asking them to clarify)
- I have little idea what their background knowledge/beliefs are when replying to them
- I fail to consider some people’s perspectives on some issue
It also seems like I change my mind (or at least become somewhat more sympathetic) more easily when arguing with an AI-representation of someone’s perspective, maybe due to less perceived incentive to prove that I was right all along.

Wei Dai 5 Oct 2025 23:08 UTC
5 points
2
in reply to: Raemon’s comment on: Wei Dai’s Shortform
If people started trying earnestly to convert wealth/income into more kids, we’d come under Malthusian constraints again, and before that much backsliding in living standards and downward social mobility for most people, which would trigger a lot of cultural upheaval and potential backlash (e.g., calls for more welfare/redistribution and attempts to turn culture back against “eugenics”/”social Darwinism”, which will probably succeed just like they succeeded before). It seems ethically pretty fraught to try to push the world in that direction, to say the least, and it has a lot of other downsides, so I think at this point a much better plan to increase human intelligence is to make available genetic enhancements that parents can voluntarily choose for their kids, government-subsidized if necessary to make them affordable for everyone, which avoids most of these problems.

Wei Dai 5 Oct 2025 1:35 UTC
4 points
2
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
Quantum theory and simulation arguments both suggest that there are many copies of myself in the multiverse. From a first person subjective anticipation perspective, experiencing death as nothingness seems impossible so it seems like I should either anticipate my subjective experience continuing as one of the surviving copies, or the whole concept of subjective anticipation is confused. From a third person / God’s view, death can be thought of some of the copies being destroyed or a reduction in my “measure”, but I don’t seem to fear this, just as I didn’t jump in joy to learn about having a huge number of copies in the first place. The situation seems too abstract or remote or foreign to trigger my fear (or joy) response.

Wei Dai 4 Oct 2025 23:33 UTC
4 points
0
in reply to: rahulxyz’s comment on: Cole Wyeth’s Shortform
If it became common to demand and check proofs of (human) work, there will be a strong incentive to use AI to generate such proofs, which doesn’t not seem very hard to do.

Wei Dai 4 Oct 2025 23:29 UTC
2 points
−4
in reply to: O O’s comment on: Wei Dai’s Shortform

What motive does a centralized dominant power have to allow any progress?

A culture/ideology that says the ruler is supposed to be benevolent and try to improve their subjects’ lives, which of course was not literally followed, but would make it hard to fully suppress things that could clearly make people’s lives better, like many kinds of technological progress. And historically, AFAIK few if any of the Chinese emperors tried to directly suppress technological innovation, they just didn’t encourage it like the West did, through things like patent laws and scientific institutions.

The entire world would likely look more like North Korea.

Yes, directionally it would look more like North Korea, but I think the controls would not have to be as total or harsh, because there is less of a threat that outside ideas could rush in and overturn the existing culture/ideology the moment you let your guard down.

Wei Dai 4 Oct 2025 22:58 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform

We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary

We did have to contend with memes that tried to hijack our minds to spread them horizontally (as opposed to vertically, by having more kids), but unfortunately (or fortunately) such “adversarial training” wasn’t powerful enough to instill a robust desire to maximize reproductive fitness. Our adversarial training for AI could also be very limited compared to the adversaries or natural distributional shifts the AI will face in the future.

Our fear of death is therefore much more robust than our desire to maximize reproductive fitness

My fear of death has been much reduced after learning about ideas like quantum immortality and simulation arguments, so it doesn’t seem that much more robust. Its apparent robustness in others looks like an accidental effect of most people not paying attention or being able to fully understand such ideas, which does not seem to have a relevant analogy for AI safety.

Wei Dai 4 Oct 2025 22:30 UTC
35 points
39
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I think extensive use of LLM should be flagged at the beginning of a post, but “uses an LLM in any part of its production process whatsoever” would probably result in the majority of posts being flagged and make the flag useless for filtering. For example I routinely use LLMs to check my posts for errors (that the LLM can detect), and I imagine most other people do so as well (or should, if they don’t already).

Unfortunately this kind of self flagging/reporting is ultimately not going to work, as far as individually or societally protecting against AI-powered manipulation, and I doubt there will be a technical solution (e.g. AI content detector or other kind of defense) either (short of solving metaphilosophy). I’m not sure it will do more good than harm even in the short run because it can give a false sense of security and punish the honest / reward the dishonest, but still lean towards trying to establish “extensive use of LLM should be flagged at the beginning of a post” as a norm.

Wei Dai 4 Oct 2025 4:03 UTC
7 points
0
in reply to: Raemon’s comment on: Wei Dai’s Shortform
It’s based on the idea that Keju created a long-term selective pressure for intelligence.
- The exams selected for heritable cognitive traits.
- Success led to positions in the imperial government, and therefore power and wealth.
- Power and wealth allowed for more wives, concubines, food, resources, and many more surviving children than the average person, which was something many Chinese consciously aimed for. (Note that this is very different from today’s China or the West, where cultural drift/evolution has much reduced or completely eliminated people’s desires to translate wealth into more offspring.)

Wei Dai 4 Oct 2025 3:31 UTC
−2 points
0
in reply to: testingthewaters’s comment on: Wei Dai’s Shortform
(The following is written by AI (Gemini 2.5 Pro) but I think it correctly captured my position.)

You’re right to point out that I’m using a highly stylized and simplified model of “Chinese civilization.” The reality, with its dynastic cycles, periods of division, and foreign rule, was far messier and more brutal than my short comment could convey.

My point, however, isn’t about a specific, unbroken political entity. It’s about a civilizational attractor state. The remarkable thing about the system described in “Romance of the Three Kingdoms” is not that it fell apart, but that it repeatedly put itself back together into a centralized, bureaucratic, agrarian empire, whereas post-Roman Europe fragmented permanently. Even foreign conquerors like the Manchus were largely assimilated by this system, adopting its institutions and governing philosophy (the “sinicization” thesis).

Regarding the Keju, the argument isn’t for intentional eugenics, but a de facto one. The mechanism is simple: if (1) success in the exams correlates with heritable intelligence, and (2) success confers immense wealth and reproductive opportunity (e.g., supporting multiple wives and children who survive to adulthood), then over a millennium you have created a powerful, systematic selective pressure for those traits.

The core of the thought experiment remains: is a civilization that structurally, even if unintentionally, prioritizes stability and slow biological enhancement over rapid, disruptive technological innovation better positioned to handle long-term existential risks?

Wei Dai 4 Oct 2025 0:18 UTC
42 points
2
on: Wei Dai’s Shortform
Maybe Chinese civilization was (unintentionally) on the right path: discourage or at least don’t encourage technological innovation but don’t stop it completely, run a de facto eugenics program (Keju, or Imperial Examination System) to slowly improve human intelligence, and centralize control over governance and culture to prevent drift from these policies. If the West hadn’t jumped the gun with its Industrial Revolution, by the time China got to AI, human intelligence would be a lot higher and we might be in a much better position to solve alignment.

This was inspired by @dsj’s complaint about centralization, using the example of it being impossible for a centralized power or authority to deal with the Industrial Revolution in a positive way. The contrarian in my mind piped up with “Maybe the problem isn’t with centralization, but with the Industrial Revolution!” If the world had more centralization, such that the Industrial Revolution never started in an uncontrolled way, perhaps it would have been better off in the long run.

One unknown is what would the trajectory of philosophical progress look like in this centralized world, compared to a more decentralized world like ours. The West seems to have better philosophy than China, but it’s not universal (e.g. analytical vs Continental philosophy). (Actually “not universal” is a big understatement given how little attention most people pay to good philosophy, aside from a few exceptional bubbles like LW.) Presumably in the centralized world there is a strong incentive to stifle philosophical progress (similar to China historically), for the sake of stability, but what happens when average human IQ reaches 150 or 200?

Wei Dai

Metaethics and Metaphilosophy as AI Alignment’s Central Philosophical Problems