Charlie Steiner comments on Against Almost Every Theory of Impact of Interpretability

Charlie Steiner 18 Aug 2023 11:09 UTC
2 points
0
Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
- dr_s 18 Aug 2023 11:42 UTC
  2 points
  0
  Parent
  I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
  1. if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
  2. it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
  Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.