But with enough capability and optimization pressure, we are likely in Most Forbidden Technique land. The model will find a way to route around the need for the activations to look similar, relying on other ways to make different decisions that get around your tests.
In the original post, we assumed we would have access to a roughly outer-aligned reward model. If that is the case, then an aligned LLM is a good solution to the optimization problem. This means that there aren’t as strong incentives to Goodhart on the SOO metric in a way that maintains misalignment. Of course, there is no guarantee that we will have such reward models, but current work on modeling human preferences at least seems to be moving in this direction.
We expect a lot of unforeseen hard problems to emerge with any alignment and control technique as models increase in capability but we hope that a hodgepodge of techniques (like SOO, RLHF, control, etc) could reduce the risk meaningfully enough for us to be able to safely extract useful alignment work from the first human-level agents, to be able to more effectively predict and prevent what problems can arise as capability and optimization pressures increase.
In the original post, we assumed we would have access to a roughly outer-aligned reward model. If that is the case, then an aligned LLM is a good solution to the optimization problem. This means that there aren’t as strong incentives to Goodhart on the SOO metric in a way that maintains misalignment. Of course, there is no guarantee that we will have such reward models, but current work on modeling human preferences at least seems to be moving in this direction.
We expect a lot of unforeseen hard problems to emerge with any alignment and control technique as models increase in capability but we hope that a hodgepodge of techniques (like SOO, RLHF, control, etc) could reduce the risk meaningfully enough for us to be able to safely extract useful alignment work from the first human-level agents, to be able to more effectively predict and prevent what problems can arise as capability and optimization pressures increase.