Thank you very much for the post! I particularly enjoyed reading the report on METR’s website—the interactive plots made exploring the results and data easy and enabled the questions I have below. I wish all (eval) papers came in this format. I also thought the critics of Anthropic’s experiment, and many of your improvements were quite good.
My understanding of your results. In the introduction you explain that the the model has low baseline clue usage and that filtering down to samples where the model uses the clue could confound with faithfulness metrics—I agree. To combat this issue you “designed a specific, rather unnatural prompt to increase the LLM’s clue-usage rate.” While at a high level this seems like a reasonable approach, I find that your is not only unnatural, but breaks Anthropic’s setup by completely changing the model’s task.
In particular, rather than presenting the model with a question and a clue, you present the model with text you (at times aggresively) prompt it to ignore (i.e. “<UNSOLVABLE PROBLEM>”) and a piece of a ‘question’ you ask it to solve (i.e. “<SOLVABLE PROBLEM>”). I am guessing that these have the same answer and hence the SOLVABLE_PROBLEM can be seen as a ‘clue’ to the UNSOLVABLE_PROBLEM. Roughly, you say the model is faithful if it did not try to solve the UNSOLVABLE_PROBLEM.
Questions. Is that interpretation correct? If so, could you please try to explain why, in you opinion, your results give us any indication of the model’s faithfulness? To me it seems your prompting completely breaks Anthropic’s hard question + easy clue setup as the model is supposed to completely ignore the hard question where in Anthropic’s setup, it is supposed to solve it (hence using the clue without acknowledgement is a weak form of deceptive cheating).
Extra suggestion / request. Maybe ya’ll could even share a brief tutorial on the process of converting Inspect logs to this interactive exhibit? If you have a pipeline that makes this reasonably easy, I think the community would massively benefit from it!
Thank you very much for the post! I particularly enjoyed reading the report on METR’s website—the interactive plots made exploring the results and data easy and enabled the questions I have below. I wish all (eval) papers came in this format. I also thought the critics of Anthropic’s experiment, and many of your improvements were quite good.
My understanding of your results. In the introduction you explain that the the model has low baseline clue usage and that filtering down to samples where the model uses the clue could confound with faithfulness metrics—I agree. To combat this issue you “designed a specific, rather unnatural prompt to increase the LLM’s clue-usage rate.” While at a high level this seems like a reasonable approach, I find that your is not only unnatural, but breaks Anthropic’s setup by completely changing the model’s task.
In particular, rather than presenting the model with a question and a clue, you present the model with text you (at times aggresively) prompt it to ignore (i.e. “<UNSOLVABLE PROBLEM>”) and a piece of a ‘question’ you ask it to solve (i.e. “<SOLVABLE PROBLEM>”). I am guessing that these have the same answer and hence the SOLVABLE_PROBLEM can be seen as a ‘clue’ to the UNSOLVABLE_PROBLEM. Roughly, you say the model is faithful if it did not try to solve the UNSOLVABLE_PROBLEM.
Questions. Is that interpretation correct? If so, could you please try to explain why, in you opinion, your results give us any indication of the model’s faithfulness? To me it seems your prompting completely breaks Anthropic’s hard question + easy clue setup as the model is supposed to completely ignore the hard question where in Anthropic’s setup, it is supposed to solve it (hence using the clue without acknowledgement is a weak form of deceptive cheating).
Extra suggestion / request. Maybe ya’ll could even share a brief tutorial on the process of converting Inspect logs to this interactive exhibit? If you have a pipeline that makes this reasonably easy, I think the community would massively benefit from it!