Steven Byrnes comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Steven Byrnes 10 Aug 2022 18:49 UTC
LW: 12 AF: 6
7
AF
The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here. (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here.) Some of the open problems IMO include:
- figuring out what exactly value to paint onto exactly what concepts;
- dealing with concept extrapolation when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution];
- getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox);
- getting the interpretability itself to work well (including the first-person problem, i.e. the issue that the AGI’s own intentions may be especially hard to get at with interpretability tools because it’s not just a matter of showing certain sensory input data and seeing what neurons activate)
What links here?
- My take on Jacob Cannell’s take on AGI safety by Steven Byrnes (28 Nov 2022 14:01 UTC; 72 points)