If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system’s propensities in ‘as natural’ a way as they can be expressed. E.g.
Describing events to the system as if it had ‘naturally’ observed them
Evaluating the systems’ revealed preferences by looking at the actions it chooses to take
That’s what I got out of the following paragraphs:
Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
There’s also a warning about not generalizing findings to settings which seem adjacent (but may not be):
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system.
If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system’s propensities in ‘as natural’ a way as they can be expressed. E.g.
Describing events to the system as if it had ‘naturally’ observed them
Evaluating the systems’ revealed preferences by looking at the actions it chooses to take
That’s what I got out of the following paragraphs:
There’s also a warning about not generalizing findings to settings which seem adjacent (but may not be):
Is that right?