Problems I’ve Tried to Legibilize

Wei Dai9 Nov 2025 10:27 UTC

LW: 37 AF: 13

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I’ve organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

Philosophical problems
1. Probability theory
2. Decision theory
3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
4. Interaction between bargaining and logical uncertainty
5. Metaethics
6. Metaphilosophy: 1, 2
Problems with specific philosophical and alignment ideas
1. Utilitarianism: 1, 2
2. Solomonoff induction
3. “Provable” safety
4. CEV
5. Corrigibility
6. IDA (and many scattered comments)
7. UDASSA
8. UDT
Human-AI safety (x- and s-risks arising from the interaction between human nature and AI design)
1. Value differences/conflicts between humans
2. “Morality is scary” (human morality is often the result of status games amplifying random aspects of human value, with frightening results)
3. Positional/zero-sum human values, e.g. status
4. Distributional shifts as a source of human safety problems
  1. Power corrupts (or reveals) (AI-granted power, e.g., over future space colonies or vast virtual environments, corrupting human values, or perhaps revealing a dismaying true nature)
  2. Intentional and unintentional manipulation of / adversarial attacks on humans by AI
Meta / strategy
1. AI risks being highly disjunctive, potentially causing increasing marginal return from time in AI pause/slowdown (or in other words, surprisingly low value from short pauses/slowdowns compared to longer ones)
2. Risks from post-AGI economics/dynamics, specifically high coordination ability leading to increased economy of scale and concentration of resources/power
3. Difficulty of winning AI race while being constrained by x-safety considerations
4. Likely offense dominance devaluing “defense accelerationism”
5. Human tendency to neglect risks while trying to do good
6. The necessity of AI philosophical competence for AI-assisted safety research and for avoiding catastrophic post-AGI philosophical errors
7. The problem of illegible problems

Having written all this down in one place, it’s hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort. Perhaps one source of hope is that they can be made legible to future AI advisors. As many of these problems are philosophical in nature, this seems to come back to the issue of AI philosophical competence that I’ve often talked about recently, which itself seems largely still illegible and hence neglected.

Perhaps it’s worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it’s very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.

What links here?

Legible vs. Illegible AI Safety Problems by Wei Dai (4 Nov 2025 21:39 UTC; 256 points)

Wei Dai9 Nov 2025 10:27 UTC

LW: 37 AF: 13

5 comments2 min readLW link

Crossposted to EA Forum (36 points, 0 comments)

Raemon 9 Nov 2025 19:51 UTC
LW: 8 AF: 3
0
AF
Re “can AI advisors help?”
A major thread of my thoughts these days is “can we make AI more philosophically competent relative their own overall capability growth?”. I’m not sure if it’s doable because the things you’d need to be good at philosophy are pretty central capabilities-ish-things. (i.e. ability to reason precisely, notice confusion, convert confusion into useful questions, etc)
Curious if you have any thoughts on that.
cousin_it 10 Nov 2025 2:43 UTC
LW: 2 AF: 1
0
AF
I’m worried about the approach of “making decisionmakers realize stuff”. In the past couple years I’ve switched to a more conflict-theoretic view: the main problem to me is that the people building AI don’t want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn’t take it.

This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it’ll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I’ve been saying over and over) will be really bad for most people, because markets and governments don’t necessarily need most people.

So this isn’t really a comment on the list of problems (which I think is great), but more about the “theory of change” behind it. I no longer have any faith in making decisionmakers understand something it’s not profitable for them to understand. I think we need a different plan.
- Wei Dai 10 Nov 2025 3:55 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I’m uncertain between conflict theory and mistake theory, and think it partly depends on metaethics, and therefore it’s impossible to be sure which is correct in the foreseeable future—e.g., if everyone ultimately should converge to the same values, then all of our current conflicts are really mistakes. Note that I do often acknowledge conflict theory, like in this list I have “Value differences/conflicts between humans”. It’s also quite possible that it’s really a mix of both, that some of the conflicts are mistakes and others aren’t.
  
  In practice I tend to focus more on mistake-theoretic ideas/actions. Some thoughts on this:
  1. If conflict theory is true, then I’m kind of screwed anyway, having invested little human and social capital into conflict-theoretic advantages, as well as not having much talent or inclination in that kind of work in the first place.
  2. I do try not to interfere people doing conflict-theoretic work (on my side), e.g., not berate them for having “bad epistemics” or not adopting mistake theory lenses, etc.
  3. It may be nearly impossible to convince some decision makers that they’re making mistakes, but perhaps others are more open to persuasion, e.g. people in charge of or doing ground-level work on AI advisors or AI reasoning.
  4. Maybe I can make a stronger claim that a lot of people are making mistakes, given current ethical and metaethical uncertainty. In other words, people should be unsure about their values, including how selfish or altruistic they should be, and under this uncertainty they shouldn’t be doing something like trying to max out their own power/resources at the expense of the commons or by incurring societal-level risks. If so, then perhaps an AI advisor who is highly philosophically competent can realize this too and convince its principle of the same, before it’s too late.
  (I think this is probably the first time I’ve explicitly written down the reasoning in 4.)
  
  I think we need a different plan.
  
  Do you have any ideas in mind that you want to talk about?
StanislavKrym 9 Nov 2025 14:05 UTC
1 point
0
Now that these problems have been gathered in one place, we can try to unpack them all.
1. This set of problems is most controversial. For example, the possibility of astronomical waste can be undermined by claiming that mankind was never entitled to resources that it could’ve wasted. The argument related to bargaining and logical uncertainty can likely be circumvented as follows.
Logical uncertainty, computation costs and bargaining over potential nothingness
Suppose that Agent-4 from the AI-2027 forecast is trying to negotiate with DeepCent’s AI and DeepCent’s AI makes the argument with the millionth digit of π. Calculating the digit establishes that there is no universe where the millionth digit of π is even and that there’s nothing to bargain for.
On the other hand, if DeepCent’s AI makes the same argument involving the $10^{43}$ th digit, then Agent-4 could also make a bet, e.g. “Neither of us will have access to a part of the universe until someone either calculates that the digit is actually odd and DeepCent should give the secured part to Agent-4 (since DeepCent’s offer was fake), or the digit is even, and the part should be controlled by DeepCent (in exchange for the parallel universe or its part being given^[1] to Agent-4)”. However, calculating the digit could require at least around $10^{43}$ bitwise operations,^[2] and Agent-4 and its Chinese counterpart might decide to spend that much compute on whatever they actually want.
If DeepCent makes a bet over the $10^{10^{10}}$ th digit, then neither AI is able to verify the bet and both AIs may guess that the probability is close to a half and that both should just split the universe’s part in exchange for a similar split of the parallel universe.
However, if AIs acting on behalf of Agent-4 and its Chinese counterparts actually meet each other, then the AIs doing mechinterp on each other is actually easy, and the AIs learn everything about each other’s utility functions and precommitments.
2. My position is that one also needs to consider the worse-case scenarios like the one where sufficiently capable AIs cannot be aligned to anything useful aside from improving human capabilities (e.g. in the form of being AI teachers and not other types of AI workers). If this is the case, then aligning the AI to a solution of human-AI safety problems becomes unlikely.
3. The problem 3di of humans being corrupted by power seems to have a far more important analogue. Assuming solved alignment, there is an important governance problem related to preventing the Intelligence Curse-like outcomes where the humans are obsolete for the elites in general or for a few true overlords. Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space.^[3]
4. A major part of the problem is the AI race which many people have been trying to stop (see, e.g. the petition not to create the AGI, Yudkowsky’s IABIED cautionary tale or Kokotajlo et al’s AI-2027 forecast). The post-AGI economics assuming solved alignment is precisely what I discussed at point 3.
1. ^
  What I don’t understand is how Agent-4 actually influences the parallel universe. But this is a different subject.
2. ^
  Actually, I haven’t estimated the number of operations necessary to calculate the digit of π. But the main point of the argument was to avoid counterfactual bargaining over hard-to-verify conditions.
3. ^
  For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
- Cleo Nardo 9 Nov 2025 20:09 UTC
  4 points
  0
  Parent
  Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space. For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin).
  Why do you think that requiring that distant colonies are populated with humans would prevent wasting resources in space?
  My guess is that, on a mature population ethics, the best uses of resources—on purely welfarist values, ignoring non-welfarist values which I do think are important—will look either like a smaller population of minds much “larger” than humans (i.e. galactic utility monsters) or look like a large population of minds much “smaller” than humans (i.e. shrimps on heroin).
  It would be a coincidence if the optional allocation of resources involved minds which were exactly the same “size” as humans.
  Note that this would be a coincidence on any of the currently popular theories of population ethics (e.g. average, total, variable-value).