Wei Dai comments on Alignment as uploading with more steps

Wei Dai 17 Sep 2025 21:53 UTC
LW: 4 AF: 3
0
AF

I think 4 is basically right

Do you think it’s ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of “non-upload necessity” given a different assumption.)

(Everyone seems to do this, and I’m trying to better understand people’s thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I’m pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven’t heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don’t care about themselves or a copy being powerful, they’re fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it’s true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don’t think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn’t be built upon it either.
- Cole Wyeth 17 Sep 2025 22:28 UTC
  LW: 2 AF: 1
  0
  AF Parent
  My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable.
  
  I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions.
  
  yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT.
  I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.
  - Wei Dai 17 Sep 2025 23:19 UTC
    LW: 4 AF: 3
    0
    AF Parent
    
    Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.
    
    I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
    
    At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.
    
    This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates’ values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)
    
    I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.
    
    My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
    
    better understood through AIT and mostly(?) SLT
    
    Any specific readings or talks you can recommend on this topic?
    - Cole Wyeth 18 Sep 2025 4:21 UTC
      4 points
      0
      Parent
      I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
      I am also scared of S-risks, but these can be prevented through effective governance of an emulation society. We don’t have a great track record of this so far (we have animal cruelty laws but also factory farming), and it’s not clear to me whether it’s generally easier or harder to manage in an emulation society (surveillance is potentially easier, but the scale of S-risks is much larger). So, this is a serious challenge that we will have to meet (e.g. by selecting the first few batches of uploads carefully and establishing regulations) but it seems to be somewhat distinct from alignment.
      I am less concerned about wasting (say) 10-20% of astronomical potential. I’m trying not to die here. Also, I don’t think it’s likely to be in the tens, because most of my preferences seem to have diminishing returns to scale. And because I don’t believe in “correct” values.
      I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.
      I can’t ensure that I will be, though I will fight to make it happen. If I were, I would probably try to upload a lot of rationalists in the second batch (and not, say, become a singleton).
      My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
      I would like to pause AI, I’m not sure solving metaphilosophy is in reach (though I have no strong commitment that it isn’t), and I don’t know how to build a safe philosophically super-competent assistant/oracle—or for that matter a safe superintelligence of any type (except possibly at a very high alignment tax by one of Michael K. Cohen’s proposals), unless it is (effectively) an upload, in which case I at least have a vague plan.
      Any specific readings or talks you can recommend on this topic?
      I am trying to invent a (statistical learning) theory of meta-(online learning). I have not made very much progress yet, but there is a sketch here: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
      The idea is based on “getting around” Shane Legg’s argument that there is no elegant universal learning algorithm by taking advantage of pretraining to increase the effective complexity of a simple learning algorithm: https://arxiv.org/abs/cs/0606070
      I did some related preliminary experiments: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
      The connection to SLT would look something like what @Lucius Bushnaq has been studying, except it should be the online learning algorithm that is learned: https://www.alignmentforum.org/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait
      David Quarel and others at Timaeus presented on singular learning theory for reinforcement learning at ILIAD 2. I missed it (and their results don’t seem to be published yet). Ultimately, I want something like this but for online decision making = history-based RL.
      - Wei Dai 18 Sep 2025 6:01 UTC
        6 points
        0
        Parent
        Thanks for the suggested readings.
        
        I’m trying not to die here.
        
        There are lots of ways to cash out “trying not to die”, many of which imply that solving AI alignment (or getting uploaded) isn’t even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what’s least likely to cause them to want to turn off the simulation or most likely to “rescue” you after you die here. Or, why aim for a “perfectly aligned” AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?
        
        And because I don’t believe in “correct” values.
        
        The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that’s actually a much more troubling situation then you seem to think.
        
        I don’t know how to build a safe philosophically super-competent assistant/oracle
        
        That’s in part why I’d want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.
        Cole Wyeth 18 Sep 2025 14:03 UTC
        2 points
        0
        Parent
        To be clear, I’m trying to prevent AGI from killing everyone on earth, including but not limited to me personally.
        There could be some reason (which I don’t fully understand and can’t prove) for subjective immortality, but that poorly understood possibility does not cause me to drive recklessly or stop caring about other X-risks. I suspect that any complications fail to change the basic logic that I don’t want myself or the rest of humanity to be placed in mortal danger, whether or not that danger subjectively results in death—it seems very likely to result in a loss of control.
        A long pause with intelligence enhancement sounds great. I don’t think we can achieve a very long pause, because the governance requirements become increasingly demanding as compute gets cheaper. I view my emulation scheme as closely connected to intelligence enhancement—for instance, if you ran the emulation for only twenty seconds you could use it as a biofeedback mechanism to avoid bad reasoning steps by near-instantly predicting they would soon be regretted (as long as this target grounds out properly, which takes work).