That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don’t see why it’d be a natural place to land in if alignment fails.
I think it’s a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future.
This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don’t see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it’s still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don’t endorse human extinction).
I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.
That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don’t see why it’d be a natural place to land in if alignment fails.
I think it’s a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future.
This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don’t see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it’s still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don’t endorse human extinction).
I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.