What if “friendly/​unfriendly” GAI isn’t a thing?

A quick sketch of thoughts. Epistemic status: pure speculation.

  • Recent AI developments are impressive

    • Both text and images this week

    • If I don’t hesitate to say my dog is thinking, can’t really deny that these things are.

  • Still, there are two important ways these differ from human intelligence.

    • How they learn

      • Automatic-differentiation uses a retrospective god’s-eye view that I think is probably importantly different from how humans learn. Certainly seems to require more substantially training data for equivalent performance.

    • What they want

      • They optimize for easily-measurable functions. In most cases this is something like “predicting this input” or “fool this GAN into thinking you’re indistinguishable from this input”

      • This is subject to Goodhart’s law.

  • Eventually, these differences will probably be overcome.

    • Learning and acting will become more integrated.

    • Motivation will become more realistic.

      • I suspect this will be through some GAN-like mechanism. That is: train a simpler “feelings” network to predict “good things are coming” based on inputs including some simplified view of the acting-network’s internals. Then train the more-complex “actor”-network to optimize for making the “feelings” happy.

  • Note that the kind of solution I suggest above could lead to a “neurotic” AI

    • Very smart “actor” network, but “feeling” network not as smart and not pointing towards consistently optimizing over world-states.

    • “Actor” is optimizing for “feelings” now, and “feelings” is predicting world-state later, but even if “actor” is smart enough to understand that disconnect it could still have something like pseudo-akrasia (poor term?) about directly optimizing for world-state (or just feelings) later. A simple example: it could be smart enough to wire-head in a way that would increase its own “happiness”, but still not do so because the process of establishing wire-heading would not itself be “fun”.

  • What if that’s a general truth? That is, any intelligence is either subject to Goodhart’s Law in a way so intrinsic/​fundamental that it undermines its learning/​effectiveness in the real world, OR it’s subject to “neuroses” in the sense that it’s not consistently optimizing over world-states?

    • Thus, either it’s not really GAI, or it’s not really friendly/​unfriendly.

    • GA(super)I would still be scarily powerful. But it wouldn’t be an automatic “game over” for any less-intelligent optimizers around it. They might still be able to “negotiate” “win-win” accommodations by nudging (“manipulating”?) the AI to different local optima of its “feelings”. And the “actor” part of the AI would be aware that they were doing so but not necessarily directly motivated to stop them.

Obviously, the above is unfinished, and has lots of holes. But I think it’s still original and thought-provoking enough to share. Furthermore, I think that making it less sketchy would probably make it worse; that is, many of the ideas y’all (collectively) have to fill in the holes here are going to be better than the ones I’d have.