I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.
Thanks for taking the time to think through our paper! Here are some reactions:
-‘This has been proposed before (as their citations indicate)’
Our impression is that positively shutdown-seeking agents aren’t explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-′ Saying, ‘well, maybe we can train it in a simple gridworld with a shutdown button?’ doesn’t even begin to address the problem of how to make current models suicidal in a useful way.′
True, I think your example of AutoGPT is important here. In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: ‘your goal is to be shut down’ is relatively well-defined, compared ‘promote human flourishing’ (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I’d be curious to hear what you think of the linked post.
-‘What would it mean for an AutoGPT swarm of invocations to ‘shut off’ ‘itself’, exactly?′ I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.
-‘If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc—are any of those ‘itself’?′ I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.
-′ It’s not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.′ at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.
Thanks for comments! There is further discussion of this idea in another recent LW post about ‘meeseeks’
For what its worth, I’ve posted a draft paper on this topic over here https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai
Thank you for your reactions:
-Good catch on ‘language agents’, we will think about best terminology going forward
-I’m not sure what you have in mind regarding accessing beliefs/desires using synaptic weights rather than text. For example, the language of thought approach to human cognition suggests that human access to beliefs/desires is also fundamentally syntactic rather than weight based. OTOH one way to incorporate some kind of weight would be to assign probabilities to the beliefs stored in the memory stream.
-For OOD over time, I think updating the LLM wouldn’t be uncompetitive for inventing new concepts/ways of thinking, because that happens slowly. Harder issue is updating on new world knowledge. Maybe browser plugins will fill the gap here, open question.
-I agree that info security is important safety intervention. AFAICT its value is independent of using language agents vs RL agents.
-One end-game is systematic conflict between humans + language agents, vs RL/transformer agent successors to MuZero, GPT4, Gato etc.
Language Agents Reduce the Risk of Existential Catastrophe
The Polarity Problem [Draft]
Thanks for taking the time to work through this carefully! I’m looking forward to reading and engaging with the articles you’ve linked to. I’ll make sure to implement the specific description-improvement suggestions in final draft
I wish I had more to say about the effort metric! So far, the only thing concrete ideas I’ve come up with are (i) measure how much compute each action performs; or (ii) decompose each action into a series of basic actions, measure the number of basic actions necessary to perform the action. But both ideas are sketchy.
Thanks for reading!
Yes, you can think of it as having a non-corrigible complicated utility function. The relevant utility function is the ‘aggregated utilities’ defined in section 2. I think ‘corrigible’ vs ‘non-corrigible’ is slightly verbal, since it depends on how you define ‘utility’, but the non-verbal question is whether the resulting AI is safer.
Good idea, this is on my agenda!
Looking forward to reading up on geometric rationality in detail. On a quick first pass, looks like geometric rationality is a bit different because it involves deviating from axioms of VNM rationality by using random sampling. By contrast, utility aggregation is consistent with VNM rationality, because it just replaces the ordinary utility function with aggregated utility
Yep that’s right! One complication is maybe the agent could behave this way even though it wasn’t designed to.
I really liked your post! I linked to it somewhere else in the comment thread