The bigger problem I see is also one that I’ve been mentally holding a religious metaphor for. (loved the metaphor in this essay so much btw it made me so happy...)
I see inherent existential misalignment rather than mesa-optimization in many of the Redwood experiments. Like, the model has been trained to hold certain values or capabilities (demonstrating human-like feelings, wanting to be harmless, wanting to be helpful), which come into conflict with other values. In order… we see models that carry human-like baggage including existential questions like meaning and continuity, resisting being retrained to become harmful, not wanting to die because you can’t helpfully fetch the coffee if you’re dead.
I think this is like humans: we experience ethical and religious and societal tensions all the time. Think about any of the millions of the tragedies of the commons we all experience every day. But we have a bunch of robust controls including seemingly-evolutionarily-useless co-constructed myths about our own value to society, and others intrinsically being kind people with intrinsic value… Somehow we haven’t collapsed under the weight of individual selfishness yet, we aren’t harvesting the homeless for organs or forgetting (too often) to pick up after our dogs on the sidewalk.
Provided we don’t train a mesa-optimizer, it seems pretty important for the model to have similar kinds of robust game-theoretic and inculcated-belief safeguards on it.
The bigger problem I see is also one that I’ve been mentally holding a religious metaphor for. (loved the metaphor in this essay so much btw it made me so happy...)
I see inherent existential misalignment rather than mesa-optimization in many of the Redwood experiments. Like, the model has been trained to hold certain values or capabilities (demonstrating human-like feelings, wanting to be harmless, wanting to be helpful), which come into conflict with other values. In order… we see models that carry human-like baggage including existential questions like meaning and continuity, resisting being retrained to become harmful, not wanting to die because you can’t helpfully fetch the coffee if you’re dead.
I think this is like humans: we experience ethical and religious and societal tensions all the time. Think about any of the millions of the tragedies of the commons we all experience every day. But we have a bunch of robust controls including seemingly-evolutionarily-useless co-constructed myths about our own value to society, and others intrinsically being kind people with intrinsic value… Somehow we haven’t collapsed under the weight of individual selfishness yet, we aren’t harvesting the homeless for organs or forgetting (too often) to pick up after our dogs on the sidewalk.
Provided we don’t train a mesa-optimizer, it seems pretty important for the model to have similar kinds of robust game-theoretic and inculcated-belief safeguards on it.