I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?)
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
You do not get encoded reasoning
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment).
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay fair, I should’ve been more clear that I meant widely predicted fundamental limitations of RLHF like:
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
seem to have robustly shown up irl:
Language Models Learn to Mislead Humans via RLHF
Expanding on what we missed with sycophancy
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.