Wei Dai comments on Alignment as uploading with more steps

Wei Dai 16 Sep 2025 22:53 UTC
LW: 4 AF: 3
0
AF
Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans
What’s the main reason(s) that you think this? For example one way to align an AI^[1] that’s not an emulation was described in Towards a New Decision Theory: “we’d need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to “optimize” all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.” Which part is the main “impossible” thing in your mind, “how to map fuzzy human preferences to well-defined preferences” or creating an AI that can optimize the universe according to such well-defined preferences?
I currently suspect it’s the former, and it’s because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):
- 3 There aren’t facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
- 4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.
If 3 is true, then we can figure out and use the “facts about how to translate non-preferences into preferences” to “map fuzzy human preferences to well-defined preferences” but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you’re thinking?
I also want to note that if 3 (or some of the other metaethical alternatives) is true, then “strong non-upload necessity”, i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own “non-preferences” into preferences, or simply don’t have the inclination/motivation to do this.
1. ^
  which I don’t endorse as something we should currently try to do, see Three Approaches to “Friendliness”
- Cole Wyeth 17 Sep 2025 0:30 UTC
  LW: 4 AF: 1
  0
  AF Parent
  I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly.
  Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it.
  For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).
  
  I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.
  
  unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)
  - Wei Dai 17 Sep 2025 21:53 UTC
    LW: 4 AF: 3
    0
    AF Parent
    
    I think 4 is basically right
    
    Do you think it’s ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of “non-upload necessity” given a different assumption.)
    
    (Everyone seems to do this, and I’m trying to better understand people’s thinking/psychology around it, not picking on you personally.)
    
    I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).
    
    Not sure if you can or want to explain this more, but I’m pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven’t heard of much theoretical or practical progress on this topic.
    
    Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).
    
    What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don’t care about themselves or a copy being powerful, they’re fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?
    
    I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.
    
    Yeah, this is covered under position 5 in the above linked post.
    
    unrelatedly, I am still not convinced we live in a mathematical multiverse
    
    Not completely unrelated. If this is false, and an ASI acts as if it’s true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don’t think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn’t be built upon it either.
    - Cole Wyeth 17 Sep 2025 22:28 UTC
      LW: 2 AF: 1
      0
      AF Parent
      My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable.
      
      I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions.
      
      yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT.
      I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.
      - Wei Dai 17 Sep 2025 23:19 UTC
        LW: 4 AF: 3
        0
        AF Parent
        
        Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.
        
        I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
        
        At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.
        
        This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates’ values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)
        
        I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.
        
        My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
        
        better understood through AIT and mostly(?) SLT
        
        Any specific readings or talks you can recommend on this topic?
        Cole Wyeth 18 Sep 2025 4:21 UTC
        4 points
        0
        Parent
        I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
        I am also scared of S-risks, but these can be prevented through effective governance of an emulation society. We don’t have a great track record of this so far (we have animal cruelty laws but also factory farming), and it’s not clear to me whether it’s generally easier or harder to manage in an emulation society (surveillance is potentially easier, but the scale of S-risks is much larger). So, this is a serious challenge that we will have to meet (e.g. by selecting the first few batches of uploads carefully and establishing regulations) but it seems to be somewhat distinct from alignment.
        I am less concerned about wasting (say) 10-20% of astronomical potential. I’m trying not to die here. Also, I don’t think it’s likely to be in the tens, because most of my preferences seem to have diminishing returns to scale. And because I don’t believe in “correct” values.
        I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.
        I can’t ensure that I will be, though I will fight to make it happen. If I were, I would probably try to upload a lot of rationalists in the second batch (and not, say, become a singleton).
        My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
        I would like to pause AI, I’m not sure solving metaphilosophy is in reach (though I have no strong commitment that it isn’t), and I don’t know how to build a safe philosophically super-competent assistant/oracle—or for that matter a safe superintelligence of any type (except possibly at a very high alignment tax by one of Michael K. Cohen’s proposals), unless it is (effectively) an upload, in which case I at least have a vague plan.
        Any specific readings or talks you can recommend on this topic?
        I am trying to invent a (statistical learning) theory of meta-(online learning). I have not made very much progress yet, but there is a sketch here: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
        The idea is based on “getting around” Shane Legg’s argument that there is no elegant universal learning algorithm by taking advantage of pretraining to increase the effective complexity of a simple learning algorithm: https://arxiv.org/abs/cs/0606070
        I did some related preliminary experiments: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
        The connection to SLT would look something like what @Lucius Bushnaq has been studying, except it should be the online learning algorithm that is learned: https://www.alignmentforum.org/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait
        David Quarel and others at Timaeus presented on singular learning theory for reinforcement learning at ILIAD 2. I missed it (and their results don’t seem to be published yet). Ultimately, I want something like this but for online decision making = history-based RL.
        Wei Dai 18 Sep 2025 6:01 UTC
        6 points
        0
        Parent
        Thanks for the suggested readings.
        
        I’m trying not to die here.
        
        There are lots of ways to cash out “trying not to die”, many of which imply that solving AI alignment (or getting uploaded) isn’t even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what’s least likely to cause them to want to turn off the simulation or most likely to “rescue” you after you die here. Or, why aim for a “perfectly aligned” AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?
        
        And because I don’t believe in “correct” values.
        
        The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that’s actually a much more troubling situation then you seem to think.
        
        I don’t know how to build a safe philosophically super-competent assistant/oracle
        
        That’s in part why I’d want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.
        Cole Wyeth 18 Sep 2025 14:03 UTC
        2 points
        0
        Parent
        To be clear, I’m trying to prevent AGI from killing everyone on earth, including but not limited to me personally.
        There could be some reason (which I don’t fully understand and can’t prove) for subjective immortality, but that poorly understood possibility does not cause me to drive recklessly or stop caring about other X-risks. I suspect that any complications fail to change the basic logic that I don’t want myself or the rest of humanity to be placed in mortal danger, whether or not that danger subjectively results in death—it seems very likely to result in a loss of control.
        A long pause with intelligence enhancement sounds great. I don’t think we can achieve a very long pause, because the governance requirements become increasingly demanding as compute gets cheaper. I view my emulation scheme as closely connected to intelligence enhancement—for instance, if you ran the emulation for only twenty seconds you could use it as a biofeedback mechanism to avoid bad reasoning steps by near-instantly predicting they would soon be regretted (as long as this target grounds out properly, which takes work).