I agree in general that pursuing multiple alternative alignment
approaches (and using them all together to create higher levels of
safety) is valuable. I am more optimistic than you that we can design
control systems (different from time horizon based myopia) which will
be stable and understandable even at higher levels of AGI competence.
it still seems likely that someone, somewhere, will try fiddling around with another AGI’s time horizon parameters and cause a disaster.
Well, if you worry about people fiddling with control system
tuning parameters, you also need to worry about someone fiddling with
value learning parameters so that the AGI will only learn the values
of a single group of people who would like to rule the rest of the
world. Assming that AGI is possible, I believe it is most likely
that Bostrom’s orthogonality hypothesis will hold for it. I am not
optimistic about desiging an AGI system which is inherently
fiddle-proof.
I agree in general that pursuing multiple alternative alignment approaches (and using them all together to create higher levels of safety) is valuable. I am more optimistic than you that we can design control systems (different from time horizon based myopia) which will be stable and understandable even at higher levels of AGI competence.
Well, if you worry about people fiddling with control system tuning parameters, you also need to worry about someone fiddling with value learning parameters so that the AGI will only learn the values of a single group of people who would like to rule the rest of the world. Assming that AGI is possible, I believe it is most likely that Bostrom’s orthogonality hypothesis will hold for it. I am not optimistic about desiging an AGI system which is inherently fiddle-proof.