I notice that I’m confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote “The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions.”
But the entire reason we’re avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren’t we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?
(Possible things I’m missing—the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)
Correct — there’s a chance the expected utility quantilizer takes the same action as the expected utility maximizer. That probability is the inverse of the number of actions in the quantile, which is quite small (possibly measure zero) because because actionspace is so large.
Maybe it’s defined like this so it has simpler mathematical properties. Or maybe it’s defined like this because it’s safer. Not sure.
I notice that I’m confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote “The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions.”
But the entire reason we’re avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren’t we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?
(Possible things I’m missing—the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)
Correct — there’s a chance the expected utility quantilizer takes the same action as the expected utility maximizer. That probability is the inverse of the number of actions in the quantile, which is quite small (possibly measure zero) because because actionspace is so large.
Maybe it’s defined like this so it has simpler mathematical properties. Or maybe it’s defined like this because it’s safer. Not sure.