In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
The tricky part being that in the AGI alignment discourse, if you believe in self-improvement runaway feedback loops, there is no good. There is only perfect, or extinction. This might be a bit extreme but we don’t really know that for sure either.
Note that a wrench current paradigms throw in this is that self-improvement processes would not look uniquely recursive, since all training algorithms sort of look like “recursive self improvement”. instead, RSI is effectively just “oh no, the training curve was curved differently on this training run”, which is something most likely to happen in open world RL. But I agree, open world RL has the ability to be suddenly surprising in capability growth. and there wouldn’t be much of an opportunity to notice the problem unless we’ve already solved how to intentionally bound capabilities in RL.
There has been some interesting work on bounding capability growth in safe RL already, though. I haven’t looked closely at it, I wonder if any of it is particularly good.
edit: note that I am in fact claiming that after miri deconfuses us, it’ll turn out to apply to ordinary gradient updates
This isn’t about “perfect futures” though, but about perfect AGIs specifically. Consider a future that goes like this:
the AI’s presence and influence over us evolves exponentially according to a law dAIdt=γAI,
the exponent γ expresses the amount of misalignment; if the AI is aligned and fully under our control, γ=0, otherwise γ>0,
then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won’t. That’s what “perfect” here means; it’s not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn’t build AGI at all, because we can’t get it perfect, and if it’s not perfect it’ll probably be in too precarious a balance with us for it to persist for long.
Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.
The tricky part being that in the AGI alignment discourse, if you believe in self-improvement runaway feedback loops, there is no good. There is only perfect, or extinction. This might be a bit extreme but we don’t really know that for sure either.
Note that a wrench current paradigms throw in this is that self-improvement processes would not look uniquely recursive, since all training algorithms sort of look like “recursive self improvement”. instead, RSI is effectively just “oh no, the training curve was curved differently on this training run”, which is something most likely to happen in open world RL. But I agree, open world RL has the ability to be suddenly surprising in capability growth. and there wouldn’t be much of an opportunity to notice the problem unless we’ve already solved how to intentionally bound capabilities in RL.
There has been some interesting work on bounding capability growth in safe RL already, though. I haven’t looked closely at it, I wonder if any of it is particularly good.
edit: note that I am in fact claiming that after miri deconfuses us, it’ll turn out to apply to ordinary gradient updates
Au contraire, the perfect future doesn’t exist, but good ones do.
This isn’t about “perfect futures” though, but about perfect AGIs specifically. Consider a future that goes like this:
the AI’s presence and influence over us evolves exponentially according to a law dAIdt=γAI,
the exponent γ expresses the amount of misalignment; if the AI is aligned and fully under our control, γ=0, otherwise γ>0,
then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won’t. That’s what “perfect” here means; it’s not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn’t build AGI at all, because we can’t get it perfect, and if it’s not perfect it’ll probably be in too precarious a balance with us for it to persist for long.
Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.