yeah makes sense, I agree realistic / long context evals are hard (though probably just in the engineering challenge sense? like getting lots of deployment trajectories, or training better user models / environments to simulate long-contexts)
but imo “formalizing the notion of having the aligned assistant persona be very stable over contexts” seems either pretty easy or unnecessary, because we can use LLM judges with the spec.
I guess you run into scalable oversight type challenges (difficult for weak model to eval spec adherence on long transcript), which is maybe what you’re getting at / what is motivating formalization (and use of assistant axis as a kind of elk method).
yeah makes sense, I agree realistic / long context evals are hard (though probably just in the engineering challenge sense? like getting lots of deployment trajectories, or training better user models / environments to simulate long-contexts)
but imo “formalizing the notion of having the aligned assistant persona be very stable over contexts” seems either pretty easy or unnecessary, because we can use LLM judges with the spec.
I guess you run into scalable oversight type challenges (difficult for weak model to eval spec adherence on long transcript), which is maybe what you’re getting at / what is motivating formalization (and use of assistant axis as a kind of elk method).