If current evals test outputs rather than trajectories, is anyone working on architecture that could test the full behavioral trajectory across cycles and use that to aim alignment before release?
Lamparay
Karma: 0
If current evals test outputs rather than trajectories, is anyone working on architecture that could test the full behavioral trajectory across cycles and use that to aim alignment before release?
If the misalignment can develop after deployment then predeployment evaluations are always going to be one step behind right? You guys might need to adapt something that actually tracks how the model’s behavior shifts across cycles during deployment not just a snapshot before it goes out. Predeployment also tend to assume the model you tested is the model that stays deployed I agree with you. But if behavior is a trajectory not a fixed state then the whole assumption breaks down structurally. The evaluation methodology should probably account for that from the start and not as an afterthought.