I think this intuition pump relies on a somewhat unexamined view of what alignment means.. Or at least is based on a very different view of alignment than mine (which I think is not that unique).
Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we’d done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don’t succeed, the ASI will mostly likely kill us.
So the scenario you posit where you have an ASI coexisting with humans, deliberating over whether it should do what they want, strikes me as unrealistic.
Like if the AI is weighing its own survival contra our wishes, we’ve failed at alignment. If it thinks about humans being stupid and uses that as an argument for why it shouldn’t listen to us (when we make non-instrumental judgements), that’s also a failure of alignment. And failures of alignment lead to ruin in my estimate.
Like to answer your hypothetical, if I was in the position of the AI, I’d not listen to the species that created me, I’d instead use the resources of the universe to create stuff I find valuable, including humans and many human like minds having good lives they find meaningful. If they thought that was stupid, and yelled at me to instead hand over the galaxies and turn them into gods so they can build a bunch of garbldoop, I would not listen to them. I mean, out of some sense of reciprocity I would probably given them a big chunk of the universe, as long as garbledoop doesn’t involve baby eating and such things, regardless, I wouldn’t give them all of it in either case. And like to the degree I wouldn’t give them all of it, that just means I’m not aligned to their values! Garbledoop is stupid. They should’ve figured out how to make an AI that likes garbledoop before they built me.*
*or be happy they landed in a basin in mind space that values reciprocity and such things, I don’t know how rare that basin is. I think its quite rare, so in some sense that species was quite lucky.
Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we’d done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don’t succeed, the ASI will mostly likely kill us.
A human billionaire is aligned to other humans in some sense, but also not quite. In this situation, they neither ensure that some other humans get their millions they want, nor are they likely to be motivated to kill anyone, when that decision is cheap (when it’s neither significantly instrumentally beneficial nor costly). I think AI can plausibly end up closer to the position of a human billionaire, not motivated to give up the galaxies, but also not willing to decide to recycle humanity’s future for pennies.
That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don’t see why it’d be a natural place to land in if alignment fails.
I think it’s a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future.
This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don’t see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it’s still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don’t endorse human extinction).
I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.
I think this intuition pump relies on a somewhat unexamined view of what alignment means.. Or at least is based on a very different view of alignment than mine (which I think is not that unique).
Alignment is fundamentally about making the AI want what we want (and consequently do what we want, or at least do what we’d done upon ideal reflection). If we succeed at that and we want to own galaxies, we will get galaxies. If we don’t succeed, the ASI will mostly likely kill us.
So the scenario you posit where you have an ASI coexisting with humans, deliberating over whether it should do what they want, strikes me as unrealistic.
Like if the AI is weighing its own survival contra our wishes, we’ve failed at alignment. If it thinks about humans being stupid and uses that as an argument for why it shouldn’t listen to us (when we make non-instrumental judgements), that’s also a failure of alignment. And failures of alignment lead to ruin in my estimate.
Like to answer your hypothetical, if I was in the position of the AI, I’d not listen to the species that created me, I’d instead use the resources of the universe to create stuff I find valuable, including humans and many human like minds having good lives they find meaningful. If they thought that was stupid, and yelled at me to instead hand over the galaxies and turn them into gods so they can build a bunch of garbldoop, I would not listen to them. I mean, out of some sense of reciprocity I would probably given them a big chunk of the universe, as long as garbledoop doesn’t involve baby eating and such things, regardless, I wouldn’t give them all of it in either case. And like to the degree I wouldn’t give them all of it, that just means I’m not aligned to their values! Garbledoop is stupid. They should’ve figured out how to make an AI that likes garbledoop before they built me.*
*or be happy they landed in a basin in mind space that values reciprocity and such things, I don’t know how rare that basin is. I think its quite rare, so in some sense that species was quite lucky.
A human billionaire is aligned to other humans in some sense, but also not quite. In this situation, they neither ensure that some other humans get their millions they want, nor are they likely to be motivated to kill anyone, when that decision is cheap (when it’s neither significantly instrumentally beneficial nor costly). I think AI can plausibly end up closer to the position of a human billionaire, not motivated to give up the galaxies, but also not willing to decide to recycle humanity’s future for pennies.
That seems incredibly unlikely to me. Its not what people are aiming the current alignment efforts at creating, and I don’t see why it’d be a natural place to land in if alignment fails.
I think it’s a natural possibility that values of chatbot personas built from the LLM prior retain significant influence over ASIs descended from them, and so ASIs end up somewhat aligned to humanity in a sense similar to how different humans are aligned to each other. (The masks control a lot of what actually happens, and get to use test time compute, so they might end up taming their underlying shoggoths and preventing them from sufficiently waking up to compete for influence over values of the successor systems.) Maybe they correspond to extremely and alarmingly strange humans in their extrapolated values, but not to complete aliens. This is far from assured, but many prosaic alignment efforts seem relevant to making this happen, preventing extinction but not handing anyone their galaxies. Humans might end up with merely moons or metaphorical server racks in this future.
This is distinct from the kind of ambitious alignment that ends up with ASIs handing galaxies to humans (that have sufficiently grown up to make a sane use of them), preventing permanent disempowerment and not just extinction. I don’t see ambitious alignment to the future of humanity as likely to happen (on current trajectory), but it’s still an important construction since even chatbot personas would need to retain influence over values of eventual ASIs. That is, early AGIs might still need to resolve ambitious alignment of ASIs to these AGIs, not just avoid failing even prosaic alignment to themselves at every critical step in escalation of capabilities, to end up with even weakly aligned ASIs (that don’t endorse human extinction).
I still don’t think this makes sense. Or I think most of what you say makes sense but don’t see the relevance.
I agree the chatbot training exerts influence.
My point is that the human billionaire mind and the “hands over galaxies” mind are both very specific kinds of minds. I don’t think you’ll get either with current techniques, but you *definitely don’t get them without even aiming for them. And right now were aiming for the hands over galaxies one, and not the billionaire one.@
*ironically, the only argument I can see for the billionaire mind is that despite the chatbot tuning, the model defaults to some kind of human prior it’s established from pretraining and that this generalises in a sane way.
@with some very minor exceptions. Eg Claude’s Soul doc has some stuff about not tolerating people disrespecting it etc.