I understand that the mesa objective could be quite different from the base objective. But
...wait, maybe something just clicked. We might suspect that the mesa objective looks like (roughly) influence-seeking since that objective is consistent with all of the outputs we’ve seen from the system (and moreover we might be even more suspicious that particularly influential systems were actually optimizing for influence all along), and maybe an agent-ish mesa-optimizer is selected because it’s relatively good at appearing to fulfill the base objective...?
I guess I (roughly) understood the inner alignment concern but still didn’t think of the mesa-optimizer as an agent… need to read/think more. Still feels likely that we could rule out agent-y-ness by saying something along the lines of “yes some system with these text inputs could be agent-y and affect the real world, but we know this system only looks at the relative positions of tokens and outputs the token that most frequently follows those; a system would need a fundamentally different structure to be agent-y or have beliefs or preferences” (and likely that some such thing could be said about GPT-3).
One somewhat plausible argument I’ve heard is that GPTs are merely feedforward networks and that agency is relatively unlikely to arise in such networks. And of course there’s also the argument that agency is most natural/incentivised when you are navigating some environment over an extended period of time, which GPT-N isn’t. There are lots of arguments like this we can make. But currently it’s all pretty speculative; the relationship between base and mesa objective is poorly understood; for all we know even GPT-N could be a dangerous agent. (Also, people mean different things by “agent” and most people don’t have a clear concept of agency anyway.)
I understand that the mesa objective could be quite different from the base objective. But
...wait, maybe something just clicked. We might suspect that the mesa objective looks like (roughly) influence-seeking since that objective is consistent with all of the outputs we’ve seen from the system (and moreover we might be even more suspicious that particularly influential systems were actually optimizing for influence all along), and maybe an agent-ish mesa-optimizer is selected because it’s relatively good at appearing to fulfill the base objective...?
I guess I (roughly) understood the inner alignment concern but still didn’t think of the mesa-optimizer as an agent… need to read/think more. Still feels likely that we could rule out agent-y-ness by saying something along the lines of “yes some system with these text inputs could be agent-y and affect the real world, but we know this system only looks at the relative positions of tokens and outputs the token that most frequently follows those; a system would need a fundamentally different structure to be agent-y or have beliefs or preferences” (and likely that some such thing could be said about GPT-3).
Yep! I recommend Gwern’s classic post on why tool AIs want to be agent AIs.
One somewhat plausible argument I’ve heard is that GPTs are merely feedforward networks and that agency is relatively unlikely to arise in such networks. And of course there’s also the argument that agency is most natural/incentivised when you are navigating some environment over an extended period of time, which GPT-N isn’t. There are lots of arguments like this we can make. But currently it’s all pretty speculative; the relationship between base and mesa objective is poorly understood; for all we know even GPT-N could be a dangerous agent. (Also, people mean different things by “agent” and most people don’t have a clear concept of agency anyway.)