Stepped away to spend some time on AI safety (deception and red-teaming). Past lives in research, and government tech
Dusto
Would be interesting to see how this translates to non-math tasks. I’m wondering if this is unique to the reasoning required for solving maths problems and if it still holds in tasks (like needle in a haystack for example) that may require reasoning over conflicting goals, etc.
One side consideration: In relation to your ongoing work around model self-awareness of eval settings, is there any indication that models deployed as a service, business, etc have their behaviour naturally logged as part of quality assurance, and do the models have any understanding or expectation that this is occurring?
Excited to see more in this space. Was there any consideration to measuring “when” in addition to “if” these specific features trigger a sense of evaluation? Such as a multiple round interaction where strong features are slowly introduced into the conversation, or if strong features are introduced early in the exchange, does that bias the model to be on alert? and if so, for how long? if there is sufficient typical conversation following a strong feature queue does that reduce the likelihood of the model judging the entire interaction as being an evaluation?
The pre or post assessment seems like it might either prime the expectation of an incoming evaluation (pre), or possibly draw attention to evaluation-like features in hindsight (post). I wonder if there is an independent measure (something like the number of tokens used measurement, in the Reworr & Volkov LLM honeypot paper as a distinguisher between human/bot and LLM) that doesn’t explicitly call out evals directly.
Not yet unfortunately. I have been toying with some multi-turn tasks lately in that space, but agreed that math/symbolic seems the clear winner for pushing CoT. (thanks for the paper link, will add to the queue!)