RamblinDash comments on Explained Simply: Quantilizers

RamblinDash 8 Sep 2023 13:22 UTC
7 points
1
In my other aligning-a-human-level-intelligence project (parenting), my kids get “points” for trying new foods. We are often having arguments about what kinds of trivial modifications to an old food will make it count as a new food. This seems like it could have a similar problem—couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
Similarly, since the tails come apart, perhaps it would be better to sample from 85-95%ile actions instead of sampling from 90-100%ile actions.
- XelaP 1 Mar 2026 11:30 UTC
  1 point
  0
  Parent
  
  couldn’t a superintelligence generate thousands of non-substantive variations for an effective, dangerous action while electing not to do so for other actions?
  
  I’m not sure what you mean. The model here is that you have a way to sample from a learned approximation to the distribution of what humans/experts/(whatever “safe” thing you trust does), and then the superintelligence picks uniformly from the top x% of those samples.
  
  I don’t think shooting for a lower region would help much in practice—I expect most case to also have bad 85-85+p% actions.