I believe this new dataset I’ve created is the best existing dataset for evaluating n-hop latent reasoning.
This seems correct, but I remain pretty confused at the threat model n-hop work claims to be targeting. Is there a specific threat model where a given model wouldn’t have seen particular relevant facts to something like sabotaging AI R&D or opaque coup research together in some other context during training?
I think it’s more that LLMs were interestingly bad at n-hop latent reasoning and it might be generally representative of a certain type of within-single-forward weakness of current architectures. It’s not clear exactly what “n-hop” reasoning effectively about scheming without this information in context corresponds to.
I do think that relevant information won’t be in context during the best opportunities for misaligned AIs to sabotage, escape, or otherwise covertly achieve their aims, but the information the AIs need will be much more closesly associated rather than being unrelated hops.
This seems correct, but I remain pretty confused at the threat model n-hop work claims to be targeting. Is there a specific threat model where a given model wouldn’t have seen particular relevant facts to something like sabotaging AI R&D or opaque coup research together in some other context during training?
I think it’s more that LLMs were interestingly bad at n-hop latent reasoning and it might be generally representative of a certain type of within-single-forward weakness of current architectures. It’s not clear exactly what “n-hop” reasoning effectively about scheming without this information in context corresponds to.
I do think that relevant information won’t be in context during the best opportunities for misaligned AIs to sabotage, escape, or otherwise covertly achieve their aims, but the information the AIs need will be much more closesly associated rather than being unrelated hops.