Ah, you mean that “alignment” is a different problem than “subhuman and human-imitating training safety”? :P
“Quantilizing from the human policy” is human-imitating in a sense, but also superhuman. At least modestly superhuman—depends on how hard you quantilize. (And maybe very superhuman in speed.)
If you could fork your brain state to create an exact clone, would that clone be “aligned” with you? I think that we should define the word “aligned” such that the answer is “yes”. Common sense, right?
Seems to me that if you say “yes it’s aligned” to that question, then you should also say “yes it’s aligned” to a quantilize-from-the-human-policy agent. It’s kinda in the same category, seems to me.
Hmm, Stuart Armstrong suggested here that “alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power.” So then maybe as you quantilize harder and harder, you get less and less confident in that system’s “alignment”?
(I’m not sure we’re disagreeing about anything substantive, just terminology, right? Also, I don’t actually personally buy into this quantilization picture, to be clear.)
Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.
I think quantilization is safe when it’s a slightly “lucky” human-imitation (also if it’s a slightly “lucky” version of some simpler base distribution, but then it won’t be as smart). But push too hard, which might not be very hard at all if you’re iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-like behaviors that serve its purposes. (Vanessa pointed out to me that timeline-based DRL gets around the iteration problem because it relies on the human as an oracle for expected utility.)
“Quantilizing from the human policy” is human-imitating in a sense, but also superhuman. At least modestly superhuman—depends on how hard you quantilize. (And maybe very superhuman in speed.)
If you could fork your brain state to create an exact clone, would that clone be “aligned” with you? I think that we should define the word “aligned” such that the answer is “yes”. Common sense, right?
Seems to me that if you say “yes it’s aligned” to that question, then you should also say “yes it’s aligned” to a quantilize-from-the-human-policy agent. It’s kinda in the same category, seems to me.
Hmm, Stuart Armstrong suggested here that “alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power.” So then maybe as you quantilize harder and harder, you get less and less confident in that system’s “alignment”?
(I’m not sure we’re disagreeing about anything substantive, just terminology, right? Also, I don’t actually personally buy into this quantilization picture, to be clear.)
Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.
I think quantilization is safe when it’s a slightly “lucky” human-imitation (also if it’s a slightly “lucky” version of some simpler base distribution, but then it won’t be as smart). But push too hard, which might not be very hard at all if you’re iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-like behaviors that serve its purposes. (Vanessa pointed out to me that timeline-based DRL gets around the iteration problem because it relies on the human as an oracle for expected utility.)