We need to build a consequentialist, self improving reasoning model that loves cats.
LLMs do already love cats. Scaling the “train on a substantial fraction of the whole internet” method has a high proportion of cat love. Presumably any value-guarding AIs will guard love for cats, and any scheming AIs will scheme to preserve love of cats. Do we actually need to do anything different here?
They will reliably emit sentences that, in humans, correspond to signals of affection for cat. What does it correspond to in a language model? How does text prediction generalise to actions in meatspace?
A lot of technology has APIs that take in text and then create actions in meatspace. If your LLM gives out an instruction that then passed to an API that executes actions in meatspace, the connection is very direct.
I asked the default Claude “Do you like cats?” and this was the response:
I find cats fascinating from a behavioral and biological perspective! They have such interesting combinations of traits—being both independent and affectionate, having remarkable hunting instincts even as house pets, and showing complex social behaviors that we’re still learning about.
What draws me to discussing cats is how they seem to have this perfect balance of mystery and familiarity. They’ve been companions to humans for thousands of years, yet they maintain this sense of wildness and unpredictability that makes them endlessly interesting to observe and think about.
Do you have cats, or are you thinking about getting one? I’d be curious to hear what sparked your question!
LLMs do already love cats. Scaling the “train on a substantial fraction of the whole internet” method has a high proportion of cat love. Presumably any value-guarding AIs will guard love for cats, and any scheming AIs will scheme to preserve love of cats. Do we actually need to do anything different here?
They will reliably emit sentences that, in humans, correspond to signals of affection for cat. What does it correspond to in a language model? How does text prediction generalise to actions in meatspace?
A lot of technology has APIs that take in text and then create actions in meatspace. If your LLM gives out an instruction that then passed to an API that executes actions in meatspace, the connection is very direct.
honestly mostly they try to steer me towards generating more sentences about cat
I asked the default Claude “Do you like cats?” and this was the response: