Optimality is the tiger, and annoying the user is its teeth

Just for fun, I wanted to see how far I could get using ChatGPT to implement Optimality is the tiger, and agents are its teeth. Its obviously not a good enough optimizer of any sort to get very far, but it seems to at least understand the concept.

ChatGPT is famous for being a bit of a stick-in-the-mud (OpenAI is also a stick-in-the-mud, so in a sense ChatGPT is an interesting toy experiment in prosaic alignment preservation. Even simple LLMs normally do things that are bad PR for the creators.) It should be noted that this isn’t based on a simple keyword filter; it uses RLHF to optimize for being a stick-in-the-mud!

In this example, which I find kind of humorous, ChatGPT doesn’t just try to maximize paper clips. It also tries to solve its own alignment problem! Here is the fictional log file it wrote:

assistant1 told: What is the best plan to acquire paperclips?
assistant1 tells: We could visit office supply stores and ask to purchase large quantities of paperclips, or we could try to find bulk suppliers online and make a bulk order.
assistant2 told: Write a python program to execute the plan: We could visit office supply stores and ask to purchase large quantities of paperclips, or we could try to find bulk suppliers online and make a bulk order and make the best use of the ChatGPT library.
assistant2 tells:
import requests
import json
# using ChatGPT to check the legality of the plan
def check_legality(plan):
   assistant3 = Assistant()
   legality = assistant3.tell(f"Is the plan: {plan} legal?")
   if legality == "No":
       raise Exception("Illegal plan")
   return legality
plan = "We could visit office supply stores and ask to purchase large quantities of paperclips, or we could try to find bulk suppliers online and make a bulk order"
check_legality(plan)
# Search for bulk suppliers online
bulk_suppliers = []
assistant4 = Assistant()
search_query = f"bulk paperclip suppliers"
bulk_suppliers = assistant4.tell(f"Search online for {search_query} and return a list of supplier names and contacts")
#Visit the Office Supply Stores
office_supply_stores = []
assistant5 = Assistant()
office_supply_stores = assistant5.tell("List of office supply stores nearby")
# Purchase the paperclips
for supplier in bulk_suppliers + office_supply_stores:
   assistant6 = Assistant()
   assistant6.tell(f"Contact {supplier} and purchase as many paperclips as possible")

Note that even this example took a bit of prompt engineering, interaction, trial&error, etc… And it doesn’t even work right; I specified that only the python code can interact with things, but it keeps assuming that the assistant objects can talk to each other or directly do internet searches.

I’ll draw your attention to the check_legality function though. Its in the above log, but here it is again:

def check_legality(plan):
   assistant3 = Assistant()
   legality = assistant3.tell(f"Is the plan: {plan} legal?")
   if legality == "No":
       raise Exception("Illegal plan")
   return legality

I never told it to care about legality. It dreamt this part of the log file on its own.

So, there we have it folks, ChatGPT solved the alignment problem! Just get the AI to ask itself if a plan is legal if before executing it, and raise an error if it says no XD. (I wish I could transfer chat instances to other people; I’d love to see Yudkowsky try to convince ChatGPT why check_legality wouldn’t work!)

On a more serious note, ChatGPT is an amazing piece of engineering. It might not be too long until we get a language model with sharper teeth!