Steven Byrnes comments on Another list of theories of impact for interpretability

Steven Byrnes 15 Apr 2022 19:04 UTC
LW: 3 AF: 3
0
AF
Nice post!
I really don’t know much about frontal lobotomy patients. I’ll irresponsibly speculate anyway.
I think “figuring out the solution to tricky questions” has a lot in common with “getting something tricky done in the real world”, despite the fact that one involves “internal” actions (i.e., thinking the appropriate thoughts) and the other is “external” actions (i.e., moving the appropriate muscles). I think they both require the same package of goal-oriented planning, trial-and-error exploration via RL, and so on. (See discussion of “RL-on-thoughts” here.) By contrast, querying existing knowledge doesn’t require that—as an adult, if you see a rubber ball falling, you instinctively expect it to bounce, and I claim that the algorithm forming that expectation does not require or involve RL. I would speculate that frontal lobotomy patients lose their ability to BOTH “figure out the solution to tricky questions” AND “get something tricky done in the real world”, because the frontal lobotomy procedure screws with their RL systems. But their existing knowledge can still be queried. They’ll still expect the ball to bounce.
(If there are historical cases of people getting a frontal lobotomy and then proving a new math theorem or whatever, I would be very surprised and intrigued.)
It’s hard to compare this idea to, say, a self-supervised language model, because the latter has never had any RL system in the first place. (See also here.)
If we did have an agential AI that combined RL with self-supervised learning in a brain-like way, and if that AI had already acquired the knowledge and concepts of how to make nanobots or solve alignment or whatever, then yeah, maybe “turning off the RL part” would be a (probably?) safe way to extract that knowledge, and I would think that this is maybe a bit like giving the AI a frontal lobotomy. But my concern is that this story picks up after the really dangerous part—in other words, I think the AI needs to be acting agentially and using the RL during the course of figuring out how to make nanobots or solve alignment or whatever, and that’s when it could get out of control. That problem wouldn’t be solvable by “turn off RL”. Turning off RL would prevent the AI from figuring out the things we want it to figure out.