It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here. (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here.) Some of the open problems IMO include:
figuring out what exactly value to paint onto exactly what concepts;
dealing with concept extrapolation when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution];
getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox);
getting the interpretability itself to work well (including the first-person problem, i.e. the issue that the AGI’s own intentions may be especially hard to get at with interpretability tools because it’s not just a matter of showing certain sensory input data and seeing what neurons activate)
It also works in the scenario where human programmers develop a general-purpose (i.e. retargetable) internal search process, i.e. brain-like AGI or pretty much any other flavor of model-based RL. You would look for things in the world-model and manually set their “value” (in RL jargon) / “valence” (in psych jargon) to very high or low, or neutral, as the case may be. I’m all for that, and indeed I bring it up with some regularity. My progress towards a plan along those lines (such as it is) is mostly here. (Maybe it doesn’t look that way, but note that “Thought Assessors” ( ≈ multi-dimensional value function) can be thought of as a specific simple approach to interpretability, see discussion of the motivation-vs-interpretability duality in §9.6 here.) Some of the open problems IMO include:
figuring out what exactly value to paint onto exactly what concepts;
dealing with concept extrapolation when concepts hit edge-cases [concepts can hit edge-cases both because the AGI keeps learning new things, and because the AGI may consider the possibility of executing innovative plans that would take things out of distribution];
getting safely through the period where the “infant AGI” hasn’t yet learned the concepts which we want it to pursue (maybe solvable with a good sandbox);
getting the interpretability itself to work well (including the first-person problem, i.e. the issue that the AGI’s own intentions may be especially hard to get at with interpretability tools because it’s not just a matter of showing certain sensory input data and seeing what neurons activate)