Charlie Steiner comments on Against Almost Every Theory of Impact of Interpretability

Charlie Steiner 18 Aug 2023 10:28 UTC
1 point
0
Au contraire, the perfect future doesn’t exist, but good ones do.
- dr_s 18 Aug 2023 10:36 UTC
  5 points
  1
  Parent
  This isn’t about “perfect futures” though, but about perfect AGIs specifically. Consider a future that goes like this:
  1. the AI’s presence and influence over us evolves exponentially according to a law $\frac{d A I}{d t} = γ A I$ ,
  2. the exponent $γ$ expresses the amount of misalignment; if the AI is aligned and fully under our control, $γ = 0$ , otherwise $γ > 0$ ,
  then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won’t. That’s what “perfect” here means; it’s not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn’t build AGI at all, because we can’t get it perfect, and if it’s not perfect it’ll probably be in too precarious a balance with us for it to persist for long.
  - Charlie Steiner 18 Aug 2023 11:09 UTC
    2 points
    0
    Parent
    Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
    But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
    - dr_s 18 Aug 2023 11:42 UTC
      2 points
      0
      Parent
      I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
      
      if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
      
      it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
      
      Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.