Note on algorithms with multiple trained components

Example 1: consider a GAN. There’s a generator and a discriminator. As an intuitive mnemonic, we can say

  • The “purpose” of the generator is to trick the discriminator,

  • The “purpose” of the discriminator is to not get tricked by the generator.

(Relatedly, people will say “the generator is trained to trick the discriminator”, etc.)

…But (I hope) everyone knows that these bullet points are only a mnemonic.

The one and only real “purpose” of the whole system and everything in it is to generate cool images that we like, and get our papers into NeurIPS or whatever.

And indeed, I think everyone who uses GANs is aware that it’s possible for a programmer to make the discriminator “better” (when narrowly viewed as having a “purpose” of not getting tricked by the discriminator), but with the direct result of making the whole system worse at generating cool images. For example, if there were a code-change that made the discriminator perfect at discriminating, then there would be no gradient for training the generator, and the whole system would be useless.

So we shouldn’t take those bullet-point mnemonics too literally.

Example 2: In actor-critic RL, people sometimes say:

  • The “purpose” of the value function is to approximate future rewards [or discounted sum of future reward, or whatever].

…But that’s also just a mnemonic. The one and only real “purpose” of the whole RL system (of which the value function is just one part) is that it does whatever we want the RL system to do, e.g. win at chess, get our papers into NeurIPS, build us a luxury gay space communist utopia, etc.

So it’s at least conceivable that some algorithmic change would make the value function into a better approximation of the discounted sum of future rewards, yet make the RL agent worse at doing things that we want it to do.

Actually, this particular example is not merely “conceivable”, but expected, thanks to wireheading. If the value function is used to assess which plans are good versus bad, and the value function is a perfect approximation of expected future reward, then you’re almost guaranteed to get an AI that is trying to wirehead.

(I myself am a model-based RL agent (I claim), and I don’t want to wirehead, and I claim that this is directly related to my internal value function issuing very inaccurate predictions of the future reward associated with wireheading. Details in footnote.[1])

So anyway, I expect our future AGIs to have a value function that gets updated by TD learning (or some other update rule). And if they do, I expect to occasionally casually say things like “The purpose of these weight-updates is to make the value function into a better and better approximation of expected future reward”. But if I say that, please be aware that I am using the word “purpose” as a mnemonic, not to be taken too literally.

As a particular example, I often hear the claim that as RL algorithms get more and more “powerful” and “advanced” in the future, we can feel more and more confident making claims like “The value function is an extremely accurate approximation of expected future reward”. Well, I disagree! That’s not necessarily what makes an RL algorithm more “advanced”, and it’s not necessarily what future programmers will be trying to do! Indeed, when future programmers are fiddling with architectures, hyperparameters, training environments, and so on, they may sometimes go out of their way to try to make the value function worse at accurately approximating the expected future reward! (In other words, future programmers may go out of their way to try to ensure that the value function training process does not converge to the global “optimum”.)

General takeaway: An ML algorithm can have multiple parts which we can describe mnemonically as having a “purpose” related to how they’re updated (e.g. by gradient descent), but we shouldn’t take those “purposes” too literally.

From the perspective of a machine designer, the one and only true purpose of every gear in a machine is that the whole machine works well. Anything else is just a convenient imperfect approximation /​ mnemonic.

  1. ^

    I have an intellectual expectation that if I installed an electrode in a particular part of my brain and spent all day stimulating it, this would feel (at the time) like an extremely important and valuable thing to do. But that intellectual expectation in my brain has not propagated into a visceral expectation, i.e. the kind of expectation that would make me feel a craving to actually go implant an electrode in my brain right now.

    If I actually implant the electrode next week and start stimulating it, then my visceral expectation would update to synchronize with my (more accurate) intellectual expectation. In plain language, I would get addicted.

    I claim that we should describe this situation in model-based RL terms. The “intellectual expectation” is coming from my world-model, and the “visceral expectations” (including valence) are coming from my RL value function. And currently my brain’s value function is a very poor approximation of expected future rewards, in regards to this wireheading plan. Yet making it into a better approximation is a bad thing that I’d like to avoid. There is no mechanism in the brain that enforces perfect consistency between intellectual (world-model) expectations and visceral (value-function) expectations, and I’m happy for it to be that way, and I would make an AGI that way too.