I think these multihop examples (the Socrates one and nielsrolf’s) are less interesting than the sort of stuff I think Ryan is looking at because the intermediate entities can be tracked over the context as the question unfolds. E.g. if the model were to have gotten nielsrolf’s question correct—and I agree it looks like it didn’t?—then I would expect to have seen the intermediate quantities over the tokens of the question as it’s stated.
I also agree that recovering exact numbers from NLAs is a bit rough, though I’d expect to see numbers in the right ballpark.
The full report has some evaluations for steganography, measuring how much reconstruction falls when paraphrasing explanations in different ways.