In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.
couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
I’m not sure what you mean. The model here is that you have a way to sample from a learned approximation to the distribution of what humans/experts/(whatever “safe” thing you trust does), and then the superintelligence picks uniformly from the top x% of those samples.
I don’t think shooting for a lower region would help much in practice—I expect most case to also have bad 85-85+p% actions.
In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.
I’m not sure what you mean. The model here is that you have a way to sample from a learned approximation to the distribution of what humans/experts/(whatever “safe” thing you trust does), and then the superintelligence picks uniformly from the top x% of those samples.
I don’t think shooting for a lower region would help much in practice—I expect most case to also have bad 85-85+p% actions.