Thanks for writing this! I’ve found the post very interesting. I had a question/comment regarding this:
In practice, for many tasks we might want AIs to accomplish, knowing about all concrete and clear cheap-to-measure short-term outcomes will be enough to prevent most kinds of (low-stakes) oversight failures. For example, imagine using AIs to operate a power plant, where the most problematic failure modes are concrete short-term outcomes such as not actually generating electricity (e.g. by merely buying and selling electricity instead of actually producing it or by hacking electricity meters).
It seems like an additional issue here is that the AI could be messing with things beyond the system we are measuring. As an extreme example, the AI could become a “power plant maximizer” that takes over the world in order to protect the power plant from shut down. It seems like this will always be a risk because we can only realistically monitor a small part of the world. Do you have thoughts on this?
It’s important to note that we’re discussing low-stakes oversight failures here.
The failure mode could manifest as the model taking numerous actions over a significant period, culminating in it taking over the world. Our hope would be that these actions have measurable effects on a relatively small part of the world—the portion the AI can influence before taking over. This will depend on the AI’s action space, but it seems plausible for a power plant AI.
This is assuming that the AI has not having already escaped in a few actions. We would classify such an escape in a small number of actions as a high-stakes failure. If the AI gradually escaped through a large number of actions over a considerable period, we would be back to a low-stakes failure scenario, and the argument from the previous paragraph could apply.
Thanks for clarifying. Preventing escape seems like promising way to prevent these sorts of problems. On the other hand I’m having trouble imagining ways in which we could have sensors that can pick up on whether a model has escaped. (Maybe info-sec people have thought more about this?)