Shower thought: What if AI companies trained certain models to be simultaneously smart but with short horizons / minimal ability to plan? You can then use the short-horizon models to do things like evaluating output for RLAIF and detecting scheming in long-horizon models, and the short-horizon models will be bad at scheming themselves. This ameliorates the problem of detecting scheming using a model that might itself be scheming.
Given the current AI paradigm, those intelligence and horizon are strongly correlated, and it’s not immediately obvious to me how you’d break that correlation. But I don’t know that they need to be correlated in principle.
Case in point: LLMs are already smarter than most humans on 15-second time scales, but they are considerably worse than 100-IQ humans at long-term tasks.
Shower thought: What if AI companies trained certain models to be simultaneously smart but with short horizons / minimal ability to plan? You can then use the short-horizon models to do things like evaluating output for RLAIF and detecting scheming in long-horizon models, and the short-horizon models will be bad at scheming themselves. This ameliorates the problem of detecting scheming using a model that might itself be scheming.
How? This doesn’t feel possible.
Given the current AI paradigm, those intelligence and horizon are strongly correlated, and it’s not immediately obvious to me how you’d break that correlation. But I don’t know that they need to be correlated in principle.
Case in point: LLMs are already smarter than most humans on 15-second time scales, but they are considerably worse than 100-IQ humans at long-term tasks.