couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
I’m not sure what you mean. The model here is that you have a way to sample from a learned approximation to the distribution of what humans/experts/(whatever “safe” thing you trust does), and then the superintelligence picks uniformly from the top x% of those samples.
I don’t think shooting for a lower region would help much in practice—I expect most case to also have bad 85-85+p% actions.
I’m not sure what you mean. The model here is that you have a way to sample from a learned approximation to the distribution of what humans/experts/(whatever “safe” thing you trust does), and then the superintelligence picks uniformly from the top x% of those samples.
I don’t think shooting for a lower region would help much in practice—I expect most case to also have bad 85-85+p% actions.