Cole Wyeth comments on Alignment as uploading with more steps

Cole Wyeth 18 Sep 2025 4:21 UTC
4 points
0
I’m scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
I am also scared of S-risks, but these can be prevented through effective governance of an emulation society. We don’t have a great track record of this so far (we have animal cruelty laws but also factory farming), and it’s not clear to me whether it’s generally easier or harder to manage in an emulation society (surveillance is potentially easier, but the scale of S-risks is much larger). So, this is a serious challenge that we will have to meet (e.g. by selecting the first few batches of uploads carefully and establishing regulations) but it seems to be somewhat distinct from alignment.
I am less concerned about wasting (say) 10-20% of astronomical potential. I’m trying not to die here. Also, I don’t think it’s likely to be in the tens, because most of my preferences seem to have diminishing returns to scale. And because I don’t believe in “correct” values.
I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they’ve gained power. Maybe I should have clarified this with you first.
I can’t ensure that I will be, though I will fight to make it happen. If I were, I would probably try to upload a lot of rationalists in the second batch (and not, say, become a singleton).
My own “plan” (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to “power corrupts”, or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.
I would like to pause AI, I’m not sure solving metaphilosophy is in reach (though I have no strong commitment that it isn’t), and I don’t know how to build a safe philosophically super-competent assistant/oracle—or for that matter a safe superintelligence of any type (except possibly at a very high alignment tax by one of Michael K. Cohen’s proposals), unless it is (effectively) an upload, in which case I at least have a vague plan.
Any specific readings or talks you can recommend on this topic?
I am trying to invent a (statistical learning) theory of meta-(online learning). I have not made very much progress yet, but there is a sketch here: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
The idea is based on “getting around” Shane Legg’s argument that there is no elegant universal learning algorithm by taking advantage of pretraining to increase the effective complexity of a simple learning algorithm: https://arxiv.org/abs/cs/0606070
I did some related preliminary experiments: https://www.lesswrong.com/posts/APP8cbeDaqhGjqH8X/paradigms-for-computation
The connection to SLT would look something like what @Lucius Bushnaq has been studying, except it should be the online learning algorithm that is learned: https://www.alignmentforum.org/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait
David Quarel and others at Timaeus presented on singular learning theory for reinforcement learning at ILIAD 2. I missed it (and their results don’t seem to be published yet). Ultimately, I want something like this but for online decision making = history-based RL.
- Wei Dai 18 Sep 2025 6:01 UTC
  6 points
  0
  Parent
  Thanks for the suggested readings.
  
  I’m trying not to die here.
  
  There are lots of ways to cash out “trying not to die”, many of which imply that solving AI alignment (or getting uploaded) isn’t even the most important thing. For instance under theories of modal or quantum immortality, dying is actually impossible. Or consider that most copies of you in the multiverse or universe are probably living in simulations of Earth rather than original physical entities, so the most important thing from a survival-defined-indexically perspective may be to figure out what the simulators want, or what’s least likely to cause them to want to turn off the simulation or most likely to “rescue” you after you die here. Or, why aim for a “perfectly aligned” AI instead of one that cares just enough about humans to keep us alive in a comfortable zoo after the Singularity (which they may already do by default because of acausal trade, or maybe the best way to ensure this is to increase the cosmic resources available to aligned AI so they can do more of this kind of trade)?
  
  And because I don’t believe in “correct” values.
  
  The above was in part trying to point out that even something like not wanting to die is very ill defined, so if there are no correct values, not even relative to a person or a set of initial fuzzy non-preferences, then that’s actually a much more troubling situation then you seem to think.
  
  I don’t know how to build a safe philosophically super-competent assistant/oracle
  
  That’s in part why I’d want to attempt this only after a long pause (i.e. at least multi decades) to develop the necessary ideas, and probably only after enhancing human intelligence.
  - Cole Wyeth 18 Sep 2025 14:03 UTC
    2 points
    0
    Parent
    To be clear, I’m trying to prevent AGI from killing everyone on earth, including but not limited to me personally.
    There could be some reason (which I don’t fully understand and can’t prove) for subjective immortality, but that poorly understood possibility does not cause me to drive recklessly or stop caring about other X-risks. I suspect that any complications fail to change the basic logic that I don’t want myself or the rest of humanity to be placed in mortal danger, whether or not that danger subjectively results in death—it seems very likely to result in a loss of control.
    A long pause with intelligence enhancement sounds great. I don’t think we can achieve a very long pause, because the governance requirements become increasingly demanding as compute gets cheaper. I view my emulation scheme as closely connected to intelligence enhancement—for instance, if you ran the emulation for only twenty seconds you could use it as a biofeedback mechanism to avoid bad reasoning steps by near-instantly predicting they would soon be regretted (as long as this target grounds out properly, which takes work).