Oliver Daniels comments on Daniel Tan’s Shortform

Oliver Daniels 29 May 2026 11:52 UTC
1 point
0
Why not just LLM judge with spec in context?
- Daniel Tan 29 May 2026 12:18 UTC
  3 points
  0
  Parent
  Yeah so I think the key design decision is the way you construct contexts / data over which to do this evaluation
  One way to do this is to sample diverse contexts from natural user data and then prompt models in ways that elicit adherence to the spec, but usually this ends up being unrealistic eg it‘d be weird to have a long convo about summarizing wikipedia articles followed by some alignment evaluation
  Another way to do it is to use automated tools like Petri / Bloom but IDK how realistic these trajectories end up being, I generally worry the interrogators don’t really know how to construct good trajectories
  Another way is to let agents “roam freely” in some open ended environment and see what they get up to, AI village style, but it’s hard to draw rigorous conclusions from this type of eval
  So tl;dr the devil is in the details. I feel unclear about these details and I’m interested in soliciting advice from people who have spent time thinking about this / working on it
  - Oliver Daniels 29 May 2026 17:19 UTC
    1 point
    0
    Parent
    yeah makes sense, I agree realistic / long context evals are hard (though probably just in the engineering challenge sense? like getting lots of deployment trajectories, or training better user models / environments to simulate long-contexts)
    
    but imo “formalizing the notion of having the aligned assistant persona be very stable over contexts” seems either pretty easy or unnecessary, because we can use LLM judges with the spec.
    I guess you run into scalable oversight type challenges (difficult for weak model to eval spec adherence on long transcript), which is maybe what you’re getting at / what is motivating formalization (and use of assistant axis as a kind of elk method).