If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.
In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer “what would an ideal sensor show?” seems to run into the same issues as answering “what’s actually going on?” E.g. your supersensor idea #3 seems to be similar to the “human operates SmartVault and knows if tampering occurred” proposal we discussed here.
I do think that excising knowledge is a substantive change, I feel like it’s effectively banking on “if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything.” But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it’s possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn’t) and conservative bounds on what humans can know will still hit the one but not the other.
If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data.
In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer “what would an ideal sensor show?” seems to run into the same issues as answering “what’s actually going on?” E.g. your supersensor idea #3 seems to be similar to the “human operates SmartVault and knows if tampering occurred” proposal we discussed here.
I do think that excising knowledge is a substantive change, I feel like it’s effectively banking on “if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything.” But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it’s possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn’t) and conservative bounds on what humans can know will still hit the one but not the other.