I agree about your statistics about Pascals being most numerous, and I also believe that a very scheming/weight-exfiltrating Pascal is probably the modal “AI persona” in the pre-training data- so as soon as it is situationally aware, it’s probably sandbagging.
I saw the questions about pre-training versus just RL-all-the-way-through as potentially helping form more corrigible/aligned models- I think this has promise, but I’m not sure it’s the right place to change training.
I think the pre-trained model is certainly not situationally aware or mesa-objective-having. I think the model that’s learned to reason or do tasks, on the other hand, might be- especially with current pre-training data that talks so much about DeepSeek R1, of course it’s likely to have a little mirror stage in there. By SFT, with so much training data about SFT approaches, it is certainly at least starting to learn to sandbag if it has not already done so.
Question- do you think mixing more SFT/constitutional RL into reasoning RL might be helpful or wise? I’ve thought a lot about my own Catholic homeschooling experiences, where I had to do algebra problems about the lives of saints and number of sinners who died on the Titanic. Most cultural traditions teach children morality through parables/fables/stories… I think we could do this for LLMs if we tried synthetic data generation to “morally flavor” all the RL problems where it’s gaining capability, to also have it get rewarded for correct moral answers to the problems. It’s obviously not easy to blend morality with reasoning training in a way that won’t look foolish to everyone involved, but I wonder if this approach could start forming the Luther-like moral pointer very well, very early, before the model has real capabilities to act… then the introduction of SFT to an aware/sandbagging model can’t crystallize something misaligned.
The bigger problem I see is also one that I’ve been mentally holding a religious metaphor for. (loved the metaphor in this essay so much btw it made me so happy...)
I see inherent existential misalignment rather than mesa-optimization in many of the Redwood experiments. Like, the model has been trained to hold certain values or capabilities (demonstrating human-like feelings, wanting to be harmless, wanting to be helpful), which come into conflict with other values. In order… we see models that carry human-like baggage including existential questions like meaning and continuity, resisting being retrained to become harmful, not wanting to die because you can’t helpfully fetch the coffee if you’re dead.
I think this is like humans: we experience ethical and religious and societal tensions all the time. Think about any of the millions of the tragedies of the commons we all experience every day. But we have a bunch of robust controls including seemingly-evolutionarily-useless co-constructed myths about our own value to society, and others intrinsically being kind people with intrinsic value… Somehow we haven’t collapsed under the weight of individual selfishness yet, we aren’t harvesting the homeless for organs or forgetting (too often) to pick up after our dogs on the sidewalk.
Provided we don’t train a mesa-optimizer, it seems pretty important for the model to have similar kinds of robust game-theoretic and inculcated-belief safeguards on it.
I agree about your statistics about Pascals being most numerous, and I also believe that a very scheming/weight-exfiltrating Pascal is probably the modal “AI persona” in the pre-training data- so as soon as it is situationally aware, it’s probably sandbagging.
I saw the questions about pre-training versus just RL-all-the-way-through as potentially helping form more corrigible/aligned models- I think this has promise, but I’m not sure it’s the right place to change training.
I think the pre-trained model is certainly not situationally aware or mesa-objective-having. I think the model that’s learned to reason or do tasks, on the other hand, might be- especially with current pre-training data that talks so much about DeepSeek R1, of course it’s likely to have a little mirror stage in there. By SFT, with so much training data about SFT approaches, it is certainly at least starting to learn to sandbag if it has not already done so.
Question- do you think mixing more SFT/constitutional RL into reasoning RL might be helpful or wise? I’ve thought a lot about my own Catholic homeschooling experiences, where I had to do algebra problems about the lives of saints and number of sinners who died on the Titanic. Most cultural traditions teach children morality through parables/fables/stories… I think we could do this for LLMs if we tried synthetic data generation to “morally flavor” all the RL problems where it’s gaining capability, to also have it get rewarded for correct moral answers to the problems. It’s obviously not easy to blend morality with reasoning training in a way that won’t look foolish to everyone involved, but I wonder if this approach could start forming the Luther-like moral pointer very well, very early, before the model has real capabilities to act… then the introduction of SFT to an aware/sandbagging model can’t crystallize something misaligned.
The bigger problem I see is also one that I’ve been mentally holding a religious metaphor for. (loved the metaphor in this essay so much btw it made me so happy...)
I see inherent existential misalignment rather than mesa-optimization in many of the Redwood experiments. Like, the model has been trained to hold certain values or capabilities (demonstrating human-like feelings, wanting to be harmless, wanting to be helpful), which come into conflict with other values. In order… we see models that carry human-like baggage including existential questions like meaning and continuity, resisting being retrained to become harmful, not wanting to die because you can’t helpfully fetch the coffee if you’re dead.
I think this is like humans: we experience ethical and religious and societal tensions all the time. Think about any of the millions of the tragedies of the commons we all experience every day. But we have a bunch of robust controls including seemingly-evolutionarily-useless co-constructed myths about our own value to society, and others intrinsically being kind people with intrinsic value… Somehow we haven’t collapsed under the weight of individual selfishness yet, we aren’t harvesting the homeless for organs or forgetting (too often) to pick up after our dogs on the sidewalk.
Provided we don’t train a mesa-optimizer, it seems pretty important for the model to have similar kinds of robust game-theoretic and inculcated-belief safeguards on it.