Yeah so I think the key design decision is the way you construct contexts / data over which to do this evaluation
One way to do this is to sample diverse contexts from natural user data and then prompt models in ways that elicit adherence to the spec, but usually this ends up being unrealistic eg it‘d be weird to have a long convo about summarizing wikipedia articles followed by some alignment evaluation
Another way to do it is to use automated tools like Petri / Bloom but IDK how realistic these trajectories end up being, I generally worry the interrogators don’t really know how to construct good trajectories
Another way is to let agents “roam freely” in some open ended environment and see what they get up to, AI village style, but it’s hard to draw rigorous conclusions from this type of eval
So tl;dr the devil is in the details. I feel unclear about these details and I’m interested in soliciting advice from people who have spent time thinking about this / working on it
yeah makes sense, I agree realistic / long context evals are hard (though probably just in the engineering challenge sense? like getting lots of deployment trajectories, or training better user models / environments to simulate long-contexts)
but imo “formalizing the notion of having the aligned assistant persona be very stable over contexts” seems either pretty easy or unnecessary, because we can use LLM judges with the spec.
I guess you run into scalable oversight type challenges (difficult for weak model to eval spec adherence on long transcript), which is maybe what you’re getting at / what is motivating formalization (and use of assistant axis as a kind of elk method).
Why not just LLM judge with spec in context?
Yeah so I think the key design decision is the way you construct contexts / data over which to do this evaluation
One way to do this is to sample diverse contexts from natural user data and then prompt models in ways that elicit adherence to the spec, but usually this ends up being unrealistic eg it‘d be weird to have a long convo about summarizing wikipedia articles followed by some alignment evaluation
Another way to do it is to use automated tools like Petri / Bloom but IDK how realistic these trajectories end up being, I generally worry the interrogators don’t really know how to construct good trajectories
Another way is to let agents “roam freely” in some open ended environment and see what they get up to, AI village style, but it’s hard to draw rigorous conclusions from this type of eval
So tl;dr the devil is in the details. I feel unclear about these details and I’m interested in soliciting advice from people who have spent time thinking about this / working on it
yeah makes sense, I agree realistic / long context evals are hard (though probably just in the engineering challenge sense? like getting lots of deployment trajectories, or training better user models / environments to simulate long-contexts)
but imo “formalizing the notion of having the aligned assistant persona be very stable over contexts” seems either pretty easy or unnecessary, because we can use LLM judges with the spec.
I guess you run into scalable oversight type challenges (difficult for weak model to eval spec adherence on long transcript), which is maybe what you’re getting at / what is motivating formalization (and use of assistant axis as a kind of elk method).