This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn’t lose those guarantees.
However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]:
We get to pick the structure we’re searching over—the only real constraint being that it has to perform competitively. It wouldn’t matter that the ‘thermometers’ were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint.
So long as we can use the tools to prevent adversarial situations from arising in the first place, we don’t need to meet the bar of working in the face of super-human adversarial selection (I think it’s a good idea to view getting into that situation as a presumed loss condition).
In principle, greater theoretical understanding may give us more than just ‘thermometers’ - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]
This is my largest concern too: that we might find a principled-but-inefficient tools that give guarantees, but be unable to find any efficient approximation that doesn’t lose those guarantees.
However, I do think there are reasons to be cautiously optimistic, conditional on gaining a solid theoretical understanding [just my impressions: confusion entirely possible]:
We get to pick the structure we’re searching over—the only real constraint being that it has to perform competitively. It wouldn’t matter that the ‘thermometers’ were inefficient in 99% of cases, just so long as we were able to find at least one kind of structure combining thermometer-efficiency and performance. If the required [thermometer-friendly] property can be formally specified, it may be possible to incorporate it as a training constraint.
So long as we can use the tools to prevent adversarial situations from arising in the first place, we don’t need to meet the bar of working in the face of super-human adversarial selection (I think it’s a good idea to view getting into that situation as a presumed loss condition).
In principle, greater theoretical understanding may give us more than just ‘thermometers’ - e.g. we might hope to find operators that preserve particular agency-related safety properties. If updates could be applied in terms of such operators, that may reduce the required frequency of slower tests. [the specifics may not look like this, but a solid theoretical understanding would usually be expected to help you avoid problems in various ways, not only to test for them]