Ofer comments on Gradient hacking

Ofer 1 Feb 2020 7:09 UTC
LW: 5 AF: 3
AF

I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it.

Sure, we’re on the same page here. I think by “There’s still a gradient signal to change the OR gate” you mean exactly what I meant when I said “that would just be passing the buck to the output of that OR”.

I’m not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range $[0.01, 0.99]$ (if that’s important/useful for the mechanism to work).

It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain?

Humans can’t control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb’s problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb’s problem).

(Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both).

Perhaps I shouldn’t have used the word “independent”; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).