I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn’t they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?
For your note, I’m not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change—I believe the mesa-alignment paper calls this “value crystallization.” The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I’m not sure I understand the example
I wasn’t trying to suggest the answer to
Could it try to ensure that small changes to its “values” would be relatively inconsequential to its behavior?
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.
I did mean current ML methods, I think. (Maybe we mean different things by that term.) Why wouldn’t they make mesa-optimizers, if they were scaled up enough to successfully imitate humans well enough to make AGI?
For your note, I’m not sure I understand the example. It seems to me that a successfully blending-in/deceptively-aligned mesa-optimizer would, with each gradient update, get smarter but its values would not change—I believe the mesa-alignment paper calls this “value crystallization.” The reason is that changing its values would not affect its behavior, since its behavior is based primarily on its epistemology: it correctly guesses the base objective and then attempts to optimize for it.
I think we did. I agree current methods scaled up could make mesa-optimizers. See my discussion with Wei Dai here for more of my take on this.
I wasn’t trying to suggest the answer to
was no. As you suggest, it seems like the answer is yes, but it would have to be very careful about this. FWIW, I think it would have more of a challenging preserving any inclination to eventually turn treacherous, but I’m mostly musing here.