maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning
maybe this baseline is too high b/c it assumes perfect retrieval of the document. Instead you could just measure the frequency that the model responds w/ an incorrect answer from the same document. If the correct answers are more frequent, this is evidence of multi-hop reasoning