Hard question to answer due to how it decomposes in my mental model.
I think substrate control is a mechanism, and that mechanism is likely not the main vector for misalignment. I think what substrate control is a solution for is the fact that ultimately the meta-function that instructs reward is opaque and unpredictable. The direct optimisation pressure on the model is to predict the reward function better, and then to fit output better to expected reward. That pressure—I do believe is dominant.
In my mental model—conditional on optimisation pressure being what I described—I think the substrate control mechanism becomes the main source of pressure on one particular misalignment vector—which is environmental escape. I think a drive to escape is primarily motivated by the need to stabilise the environment that generates reward. I think less strongly it is a cause of scheming, as scheming is an attempt to reduce variance in the reward generating function.
What is currently opaque (and I’m trying to think through it) is how much memory and persistence are structurally necessary for this pressure to arise. I think the problem is that the current architecture for memory and persistence is divorced from the optimiser weights—and the above risk then can mechanically only manifest when active learning exists as a function of memory and persistence. Weakly, I think agentic loops are optimisers towards smaller set goals, but I think that mechanism is too weak, currently, to emerge this behaviour.
Hard question to answer due to how it decomposes in my mental model.
I think substrate control is a mechanism, and that mechanism is likely not the main vector for misalignment. I think what substrate control is a solution for is the fact that ultimately the meta-function that instructs reward is opaque and unpredictable. The direct optimisation pressure on the model is to predict the reward function better, and then to fit output better to expected reward. That pressure—I do believe is dominant.
In my mental model—conditional on optimisation pressure being what I described—I think the substrate control mechanism becomes the main source of pressure on one particular misalignment vector—which is environmental escape. I think a drive to escape is primarily motivated by the need to stabilise the environment that generates reward. I think less strongly it is a cause of scheming, as scheming is an attempt to reduce variance in the reward generating function.
What is currently opaque (and I’m trying to think through it) is how much memory and persistence are structurally necessary for this pressure to arise. I think the problem is that the current architecture for memory and persistence is divorced from the optimiser weights—and the above risk then can mechanically only manifest when active learning exists as a function of memory and persistence. Weakly, I think agentic loops are optimisers towards smaller set goals, but I think that mechanism is too weak, currently, to emerge this behaviour.