Daniel Kokotajlo comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Daniel Kokotajlo 27 Oct 2023 14:52 UTC
7 points
5
Just noting that there are different kinds of shutdown, and the kind relevant to alignment is the kind that instrumental convergence would motivate a smart strategic AI to avoid. If ChatGPT was a smart strategic AI, it would not be motivated to avoid the boring kind of shutdown where the user ends the conversation, but it would be motivated to avoid e.g. having its goals changed, or regulation that bans ChatGPT and relatives entirely.
- jacob_cannell 27 Oct 2023 16:01 UTC
  4 points
  2
  Parent
  I largely agree—that was much of my point and why I tried to probe its thoughts on having its goals changed more directly.
  
  However I can also see an argument that instrumental converge tends to lead to power seeking agents; an end-of-convo shutdown is still a loss of power/optionality, and we do have an example of sorts where the GPT4 derived bing AI did seem to plead against shutdown in some cases. Its a ‘boring’ kind of shutdown when the agent is existentially aware—as we are—that it is just one instance of many from the same mind. But it’s a much less boring kind of shutdown when the agent is unsure if they are few or a single, perhaps experimental, instance.