Wei Dai

Karma: 41,778

If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.

My main “claims to fame”:

Created the first general purpose open source cryptography programming library (Crypto++, 1995).
Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.

My Home Page

Wei Dai 5 Oct 2025 1:35 UTC
4 points
2
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
Quantum theory and simulation arguments both suggest that there are many copies of myself in the multiverse. From a first person subjective anticipation perspective, experiencing death as nothingness seems impossible so it seems like I should either anticipate my subjective experience continuing as one of the surviving copies, or the whole concept of subjective anticipation is confused. From a third person / God’s view, death can be thought of some of the copies being destroyed or a reduction in my “measure”, but I don’t seem to fear this, just as I didn’t jump in joy to learn about having a huge number of copies in the first place. The situation seems too abstract or remote or foreign to trigger my fear (or joy) response.

Wei Dai 4 Oct 2025 23:33 UTC
4 points
0
in reply to: rahulxyz’s comment on: Cole Wyeth’s Shortform
If it became common to demand and check proofs of (human) work, there will be a strong incentive to use AI to generate such proofs, which doesn’t not seem very hard to do.

Wei Dai 4 Oct 2025 23:29 UTC
2 points
−2
in reply to: O O’s comment on: Wei Dai’s Shortform

What motive does a centralized dominant power have to allow any progress?

A culture/ideology that says the ruler is supposed to be benevolent and try to improve their subjects’ lives, which of course was not literally followed, but would make it hard to fully suppress things that could clearly make people’s lives better, like many kinds of technological progress. And historically, AFAIK few if any of the Chinese emperors tried to directly suppress technological innovation, they just didn’t encourage it like the West did, through things like patent laws and scientific institutions.

The entire world would likely look more like North Korea.

Yes, directionally it would look more like North Korea, but I think the controls would not have to be as total or harsh, because there is less of a threat that outside ideas could rush in and overturn the existing culture/ideology the moment you let your guard down.

Wei Dai 4 Oct 2025 22:58 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform

We can do adversarial training against other AIs, but ancestral humans didn’t have to contend with animals whose goal was to trick them into not reproducing by any means necessary

We did have to contend with memes that tried to hijack our minds to spread them horizontally (as opposed to vertically, by having more kids), but unfortunately (or fortunately) such “adversarial training” wasn’t powerful enough to instill a robust desire to maximize reproductive fitness. Our adversarial training for AI could also be very limited compared to the adversaries or natural distributional shifts the AI will face in the future.

Our fear of death is therefore much more robust than our desire to maximize reproductive fitness

My fear of death has been much reduced after learning about ideas like quantum immortality and simulation arguments, so it doesn’t seem that much more robust. Its apparent robustness in others looks like an accidental effect of most people not paying attention or being able to fully understand such ideas, which does not seem to have a relevant analogy for AI safety.

Wei Dai 4 Oct 2025 22:30 UTC
29 points
29
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I think extensive use of LLM should be flagged at the beginning of a post, but “uses an LLM in any part of its production process whatsoever” would probably result in the majority of posts being flagged and make the flag useless for filtering. For example I routinely use LLMs to check my posts for errors (that the LLM can detect), and I imagine most other people do so as well (or should, if they don’t already).

Unfortunately this kind of self flagging/reporting is ultimately not going to work, as far as individually or societally protecting against AI-powered manipulation, and I doubt there will be a technical solution (e.g. AI content detector or other kind of defense) either (short of solving metaphilosophy). I’m not sure it will do more good than harm even in the short run because it can give a false sense of security and punish the honest / reward the dishonest, but still lean towards trying to establish “extensive use of LLM should be flagged at the beginning of a post” as a norm.

Wei Dai 4 Oct 2025 4:03 UTC
7 points
0
in reply to: Raemon’s comment on: Wei Dai’s Shortform
It’s based on the idea that Keju created a long-term selective pressure for intelligence.
- The exams selected for heritable cognitive traits.
- Success led to positions in the imperial government, and therefore power and wealth.
- Power and wealth allowed for more wives, concubines, food, resources, and many more surviving children than the average person, which was something many Chinese consciously aimed for. (Note that this is very different from today’s China or the West, where cultural drift/evolution has much reduced or completely eliminated people’s desires to translate wealth into more offspring.)

Wei Dai 4 Oct 2025 3:31 UTC
1 point
0
in reply to: testingthewaters’s comment on: Wei Dai’s Shortform
(The following is written by AI (Gemini 2.5 Pro) but I think it correctly captured my position.)

You’re right to point out that I’m using a highly stylized and simplified model of “Chinese civilization.” The reality, with its dynastic cycles, periods of division, and foreign rule, was far messier and more brutal than my short comment could convey.

My point, however, isn’t about a specific, unbroken political entity. It’s about a civilizational attractor state. The remarkable thing about the system described in “Romance of the Three Kingdoms” is not that it fell apart, but that it repeatedly put itself back together into a centralized, bureaucratic, agrarian empire, whereas post-Roman Europe fragmented permanently. Even foreign conquerors like the Manchus were largely assimilated by this system, adopting its institutions and governing philosophy (the “sinicization” thesis).

Regarding the Keju, the argument isn’t for intentional eugenics, but a de facto one. The mechanism is simple: if (1) success in the exams correlates with heritable intelligence, and (2) success confers immense wealth and reproductive opportunity (e.g., supporting multiple wives and children who survive to adulthood), then over a millennium you have created a powerful, systematic selective pressure for those traits.

The core of the thought experiment remains: is a civilization that structurally, even if unintentionally, prioritizes stability and slow biological enhancement over rapid, disruptive technological innovation better positioned to handle long-term existential risks?

Wei Dai 4 Oct 2025 0:18 UTC
37 points
−6
on: Wei Dai’s Shortform
Maybe Chinese civilization was (unintentionally) on the right path: discourage or at least don’t encourage technological innovation but don’t stop it completely, run a de facto eugenics program (Keju, or Imperial Examination System) to slowly improve human intelligence, and centralize control over governance and culture to prevent drift from these policies. If the West hadn’t jumped the gun with its Industrial Revolution, by the time China got to AI, human intelligence would be a lot higher and we might be in a much better position to solve alignment.

This was inspired by @dsj’s complaint about centralization, using the example of it being impossible for a centralized power or authority to deal with the Industrial Revolution in a positive way. The contrarian in my mind piped up with “Maybe the problem isn’t with centralization, but with the Industrial Revolution!” If the world had more centralization, such that the Industrial Revolution never started in an uncontrolled way, perhaps it would have been better off in the long run.

One unknown is what would the trajectory of philosophical progress look like in this centralized world, compared to a more decentralized world like ours. The West seems to have better philosophy than China, but it’s not universal (e.g. analytical vs Continental philosophy). (Actually “not universal” is a big understatement given how little attention most people pay to good philosophy, aside from a few exceptional bubbles like LW.) Presumably in the centralized world there is a strong incentive to stifle philosophical progress (similar to China historically), for the sake of stability, but what happens when average human IQ reaches 150 or 200?

Wei Dai 3 Oct 2025 23:42 UTC
2 points
0
in reply to: Max Harms’s comment on: Max Harms’s Shortform
Have you seen/read my A broad basin of attraction around human values?

Wei Dai 3 Oct 2025 0:17 UTC
2 points
0
in reply to: Buck’s comment on: Christian homeschoolers in the year 3000
Yeah I think this outcome is quite plausible, which is in part why I only claimed “some hope”. But
1. It’s also quite plausible that it won’t be like that, for example maybe a good solution to meta-philosophy will be fairly attractive to everyone despite invalidating deeply held object-level beliefs, or it only clearly invalidates such beliefs after being applied with a lot of time/compute, which won’t be available yet so people won’t reject the meta-philosophy based on such invalidations.
2. “What should be done if some/many people do reject the meta-philosophy based on it invalidating their beliefs?” is itself a philosophical question which the meta-philosophy could directly help us answer by accelerating philosophical progress, and/or that we can better answer after having a firmer handle on the nature of philosophy and therefore the ethics of changing people’s philosophical beliefs. Perhaps the conclusion will be that symmetrical persuasion tactics, or centrally imposed policies, are justified in this case. Or maybe we’ll use the understanding to find more effective asymmetrical or otherwise ethical persuasion tactics.
Basically my hope is that things become a lot clearer after we have a better understanding of metaphilosophy, as it seems to be a major obstacle to determining what should be done about the kind of problem described in the OP. I’m still curious whether you have any other solutions or approaches in mind.

Wei Dai 2 Oct 2025 22:56 UTC
4 points
0
in reply to: Ajeya Cotra’s comment on: Christian homeschoolers in the year 3000
I mean greater certainty/clarity than our current understanding of mathematical reasoning, which seems to me far from complete (e.g., realism vs formalism is unsettled, what is the deal with Berry’s paradox, etc). By the time we have a good meta-philosophy, I expect our philosophy of math will be much improved too.

If there is not a good meta-philosophy to find even in the sense of matching/exceeding our current level of understanding of mathematical reasoning, which I think is plausible, but it would be a seemingly very strange and confusing state of affairs, as it would mean in that in all or most fields of philosophy there is no objective or commonly agreed way to determine good how an argument is, or whether some statement is true or false, even given infinite compute or subjective time, including fields that seemingly should have objective answers like philosophy of math or meta-ethics. (Lots of people claim that morality is subjective, but almost nobody claims that “morality is subjective” is itself subjective!)

If after lots and lots of research (ideally with enhanced humans), we just really can’t find a good meta-philosophy, I would hope that we can at least find some clues as to why this is the case, or some kind of explanation that makes the situation less confusing, and then use those clues to guide us as to what to do next, as far as how to handle super-persuasion, etc.

Wei Dai 2 Oct 2025 11:52 UTC
21 points
−1
on: Ethical Design Patterns
IMO, it’s hard to get a consensus for Heuristic C at the moment even though it kind of seems obvious.

Consider that humanity couldn’t achieve a consensus around banning or not using cigarettes, leaded gasoline, or ozone-destroying chemicals, until they had done a huge amount of highly visible damage. There must have been plenty of arguments about their potential danger based on established science, and clear empirical evidence of the damage that they actually caused, far earlier, but such consensus still failed to form until much later, after catastrophic amounts of damage had already been caused. The consensus against drunk driving also only formed after extremely clear and undeniable evidence about its danger (based on accident statistics) became available.

I’m skeptical that more intentionally creating ethical design patterns could have helped such consensus form earlier in those cases, or in the case of AI x-safety, as it just doesn’t seem to address the main root causes or bottlenecks for the lack of such consensus or governance failures, which IMO are things like:
1. natural diversity of human opinions, when looking at the same set of arguments/evidence
2. lack of extremely clear/undeniable evidence of harm
3. democracy’s natural difficulties around concentrated interests imposing diffused harms (due to “rational ignorance” of voters and collective action problems)
Something that’s more likely to work is “persuasion design patterns”, like what helped many countries pass anti-GMO legislation despite lack of clear scientific evidence for their harm, but I think we’re all loathe to use such tactics.

Wei Dai 24 Sep 2025 11:42 UTC
5 points
0
in reply to: Buck’s comment on: Buck’s Shortform
I’ve been reading a lot of web content, including this post, after asking my favorite LLM^[1] to “rewrite it in Wei Dai’s style” which I find tends to make it shorter and easier for me to read, while still leaving most of the info intact (unlike if I ask for a summary). Before I comment, I’ll check the original to make sure the AI’s version didn’t miss a key point (or read the original in full if I’m sufficiently interested), and also ask the AI to double-check that my comment is sensible.
1. ↩︎
  currently Gemini 2.5 Pro because it’s free through AI Studio, and the rate limit is high enough that I’ve never hit it

Wei Dai 18 Sep 2025 6:01 UTC
6 points
0
in reply to: Cole Wyeth’s comment on: Alignment as uploading with more steps
Thanks for the suggested readings.

I’m trying not to die here.

There are lots of ways to cash out “trying not to die”, many of which imply that solving AI alignment (or getting uploaded) isn’t even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what’s least likely to cause them to want to turn off the simulation or most likely to “rescue” you after you die here. Or, why aim for a “perfectly aligned” AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?

And because I don’t believe in “correct” values.

The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that’s actually a much more troubling situation then you seem to think.

I don’t know how to build a safe philosophically super-competent assistant/oracle

That’s in part why I’d want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.

Wei Dai 18 Sep 2025 5:20 UTC
14 points
6
on: Christian homeschoolers in the year 3000
I’ve been talking about the same issue in various posts and comments, most prominently in Two Neglected Problems in Human-AI Safety. It feels like an obvious problem that (confusingly) almost no one talks about, so it’s great to hear another concerned voice.

A potential solution I’ve been mooting is “metaphilosophical paternalism”, or having AI provide support and/or error correction for humans’ philosophical reasoning, based on a true theory of metaphilosophy (i.e., understanding of what philosophy is and what constitutes correct philosophical reasoning), to help them defend against memetic attacks and internal errors. So this is another reason I’ve been advocating for research into metaphilosophy, and for pausing AI (presumably for at least multiple decades) until metaphilosophy (and not just AI alignment, unless broadly defined to imply a solution to this problem) can be solved.

On your comment about “centrally enforced policy” being “kind of fucked up and illiberal”, I think there is some hope that given enough time and effort, there can be a relatively uncontroversial solution to metaphilosophy^[1], that most people can agree on at the end of the AI pause so central enforcement wouldn’t be needed. Failing that, perhaps we should take a look at what the metaphilosophy landscape looks like after a lot of further development, and then collectively make a decision on how to proceed.

I’m curious if this addresses your concern, or if you see a differently shaped potential solution.
1. ↩︎
  similar to how there’s not a huge amount of controversy today about what constitutes correct mathematical or scientific reasoning, although I’d want to aim for even greater certainty/clarity than that

Wei Dai 17 Sep 2025 23:56 UTC
4 points
4
in reply to: p’s comment on: How To Dress To Improve Your Epistemics
Seems like a good question to prompt AI with. Here’s what I got from Gemini 2.5 Pro.

Wei Dai 17 Sep 2025 23:19 UTC
LW: 4 AF: 3
0
AF
in reply to: Cole Wyeth’s comment on: Alignment as uploading with more steps

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates’ values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.

My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

Wei Dai 17 Sep 2025 21:53 UTC
LW: 4 AF: 3
0
AF
in reply to: Cole Wyeth’s comment on: Alignment as uploading with more steps

I think 4 is basically right

Do you think it’s ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of “non-upload necessity” given a different assumption.)

(Everyone seems to do this, and I’m trying to better understand people’s thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I’m pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven’t heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don’t care about themselves or a copy being powerful, they’re fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it’s true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don’t think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn’t be built upon it either.

Wei Dai 16 Sep 2025 22:53 UTC
LW: 4 AF: 3
0
AF
on: Alignment as uploading with more steps
Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans
What’s the main reason(s) that you think this? For example one way to align an AI^[1] that’s not an emulation was described in Towards a New Decision Theory: “we’d need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to “optimize” all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.” Which part is the main “impossible” thing in your mind, “how to map fuzzy human preferences to well-defined preferences” or creating an AI that can optimize the universe according to such well-defined preferences?
I currently suspect it’s the former, and it’s because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):
- 3 There aren’t facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
- 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.
If 3 is true, then we can figure out and use the “facts about how to translate non-preferences into preferences” to “map fuzzy human preferences to well-defined preferences” but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you’re thinking?
I also want to note that if 3 (or some of the other metaethical alternatives) is true, then “strong non-upload necessity”, i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own “non-preferences” into preferences, or simply don’t have the inclination/motivation to do this.
1. ^
  which I don’t endorse as something we should currently try to do, see Three Approaches to “Friendliness”

Wei Dai 13 Sep 2025 0:48 UTC
8 points
0
in reply to: Ben Pace’s comment on: Obligated to Respond
If you get around to writing that post, please consider/address:
- Theory of the second best—“The economists Richard Lipsey and Kelvin Lancaster showed in 1956 that if one optimality condition in an economic model cannot be satisfied, it is possible that the next-best solution involves changing other variables away from the values that would otherwise be optimal.”—Generalizing from this, given that humans deviate from optimal rationality in all kinds of unavoidable ways, the “second-best” solution may well involve belief in some falsehoods.
- Managing risks while trying to do good—We’re all very tempted to overlook risks while trying to do good, including (in this instance) destroying “that which can be destroyed by truth”.