Tomek Korbak comments on Lessons from Studying Two-Hop Latent Reasoning

Tomek Korbak 15 Sep 2025 9:29 UTC
LW: 4 AF: 3
0
AF

Just because you failed at finding such a result in this case and got a more messy “LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases” doesn’t mean there aren’t other reversal-curse-like results that remain to be found.

I think that’s fair, it might be that we’ve over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).

Two big reasons why I’m hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don’t offer a clean explanation of why we see the results we see.