I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we’d have a uniquely big comparative advantage at.
One very relevant piece of infrastructure we’ve built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. I’ve also thought about building some infrastructure to make it easier to coordinate the building of handmade benchmarks—this is currently on the back burner but if this would be helpful for anyone I’d definitely get it going again.
If anyone reading this is interested in collaborating, please DM me or drop by the #prosaic-alignment channel in the EleutherAI discord.
I think a major crux is that the things you couldn’t impart on Mary through language (assuming that such things do exist) would be wishy-washy stuff like qualia whose existence, for a nonhuman system modelling humans, essentially doesn’t matter for predictive accuracy. In other words, a universe where Mary does learn something new and a universe where she doesn’t are essentially indistinguishable from the outside, so whether it shows up in world models is irrelevant.