This is my inclination, but a physicalist either predicts that the phenomenology would in fact change, or perhaps asserts that you’re deluded about your phenomenal experience when you think that the experience is the same despite substrate shifts. My understanding of cube_flipper’s position is that they anticipate changes in the substrate to change the qualia.
From a physicalist’s perspective, you’re essentially making predictions based on your theory of phenomenal consciousness, and then arguing that we should already update on those predictions ahead of time, since they’re so firm. I’m personally sympathetic to this line of argument, but it obviously depends on some assumptions which need to be articulated, and which the physicalist would probably not be happy to make.
This isn’t “the closest we can get”. Needle-in-a-haystack tests seem like a sensible starting point, but testing long-context utilization in general involves synthesis of information, EG looking at a novel or series of novels and answering reading comprehension questions. There are several benchmarks of this sort, EG:
https://epoch.ai/benchmarks/fictionlivebench
https://nyu-mll.github.io/quality/
https://www.scrolls-benchmark.com/