Daniel Tan comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Daniel Tan 11 May 2025 12:49 UTC
4 points
0
If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system’s propensities in ‘as natural’ a way as they can be expressed. E.g.
- Describing events to the system as if it had ‘naturally’ observed them
- Evaluating the systems’ revealed preferences by looking at the actions it chooses to take
That’s what I got out of the following paragraphs:
Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
There’s also a warning about not generalizing findings to settings which seem adjacent (but may not be):
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system.
Is that right?

Daniel Tan comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

Daniel Tan comments on Symbol/Referent Confusions in Language Model Alignment Experiments