You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?)
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
You do not get encoded reasoning
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment).
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
Okay fair, I should’ve been more clear that I meant widely predicted fundamental limitations of RLHF like:
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
seem to have robustly shown up irl:
Language Models Learn to Mislead Humans via RLHF
Expanding on what we missed with sycophancy
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.