My high level take is I found this very useful for understanding Anthropic broader strategy and think that I agree with a lot of the thinking. It definitely seems like some of this research could backfire but Anthropic is aware of that. The rest of my thoughts are below.
I found a lot of value in the examination of different scenarios. I think this provides the clearest explanations for why Anthropic is taking an empirical/portfolio approach. My mental models of people disagreeing with this approach involves them being either more confident about pessimistic (they would say realistic scenarios) or that they disagree with specific research agendas/have favorites. I’m very uncertain about which scenario we live in but in the context of that uncertainty, the portfolio approach seems reasonable.
I think the most contentious part of this post will probably be the arguments in favor of working with frontier models. It seems to me that while this is dangerous, the knowledge required to correctly assess a) whether this is necessary, b) what, if any, results that arise from such research should be published seems closely tied to that work itself (ie: questions like how many safety relevant phenomena just don’t exist in smaller models and how redundant work on small models becomes).
Writing this comment, I feel a strong sense of, “gee, I feel like if anyone would have the insights to know whether this stuff is a safe bet, it would be the teams at Anthropic” and that feels kind of dangerous. Independent oversight such as ARC evals might help us but a strong internal culture of red-teaming different strategies would also be good.
Quoting from the main article, I wanted to highlight some points:
Furthermore, we think that in practice, doing safety research isn’t enough – it’s also important to build an organization with the institutional knowledge to integrate the latest safety research into real systems as quickly as possible.
I think this is a really good point. The actual implementation of many alignment strategies might be exceedingly technically complicated and it seems unlikely that we could attain that knowledge quickly as opposed to over years of working with frontier models.
In a sense one can view alignment capabilities vs alignment science as a “blue team” vs “red team” distinction, where alignment capabilities research attempts to develop new algorithms, while alignment science tries to understand and expose their limitations.
This distinction also seems good to me. If there is work that can’t be published or until functional independent evaluation is working well, then high quality internal red-teaming seems essential.
Thanks Zac.
My high level take is I found this very useful for understanding Anthropic broader strategy and think that I agree with a lot of the thinking. It definitely seems like some of this research could backfire but Anthropic is aware of that. The rest of my thoughts are below.
I found a lot of value in the examination of different scenarios. I think this provides the clearest explanations for why Anthropic is taking an empirical/portfolio approach. My mental models of people disagreeing with this approach involves them being either more confident about pessimistic (they would say realistic scenarios) or that they disagree with specific research agendas/have favorites. I’m very uncertain about which scenario we live in but in the context of that uncertainty, the portfolio approach seems reasonable.
I think the most contentious part of this post will probably be the arguments in favor of working with frontier models. It seems to me that while this is dangerous, the knowledge required to correctly assess a) whether this is necessary, b) what, if any, results that arise from such research should be published seems closely tied to that work itself (ie: questions like how many safety relevant phenomena just don’t exist in smaller models and how redundant work on small models becomes).
Writing this comment, I feel a strong sense of, “gee, I feel like if anyone would have the insights to know whether this stuff is a safe bet, it would be the teams at Anthropic” and that feels kind of dangerous. Independent oversight such as ARC evals might help us but a strong internal culture of red-teaming different strategies would also be good.
Quoting from the main article, I wanted to highlight some points:
I think this is a really good point. The actual implementation of many alignment strategies might be exceedingly technically complicated and it seems unlikely that we could attain that knowledge quickly as opposed to over years of working with frontier models.
This distinction also seems good to me. If there is work that can’t be published or until functional independent evaluation is working well, then high quality internal red-teaming seems essential.
[responded to wrong comment!]