Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model.
I think those are useful, but I think more basic science results are also important, especially when they take the form “LLMs differ from humans learning from data in way X” because it lets you find ways in which modeling LLMs as “like humans, but weaker” is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
But I think these “direct observation” papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
I think that’s fair, it might be that we’ve over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).
Two big reasons why I’m hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don’t offer a clean explanation of why we see the results we see.
I think those are useful, but I think more basic science results are also important, especially when they take the form “LLMs differ from humans learning from data in way X” because it lets you find ways in which modeling LLMs as “like humans, but weaker” is wrong. In particular, I think the reversal curse paper updated me much more than this GDM paper (though I think the GDM paper has other qualities, like illustrating a specific hope for CoT monitoring, which I think is important). Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.
But I think these “direct observation” papers will get somewhat more informative as we get closer to AIs that are actually scary, since the setups could actually get quite close to real catastrophic reasoning (as opposed to the more toy setups from the GDM paper).
I think that’s fair, it might be that we’ve over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).
Two big reasons why I’m hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don’t offer a clean explanation of why we see the results we see.