This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I just want to add that “whether you should consider applying” probably depends massively on what role you’re applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Not saying you intended this, but I worry about people thinking “it’s a an alignment role and therefore good” when considering joining companies that are pushing state of the art, and not thinking about it much harder than that.
What else should people be thinking about? You’d want to be sure that you’ll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
If you get truly choose your own work, is your judgment on what will help with alignment good? (this might be true for senior hires like evhub, unsure about others getting to choose)
If you are joining existing alignment teams, is their work actually good for reducing AI x-risk vs the opposite? For example, both OpenAI and Anthropic do some variant of RLHF, which is pretty controversial – as a prospective hire, have you formed a solid opinion on this question vs relying on the convenient answer that at least some people regard it as alignment?
What is the likelihood that you are asked/pressured to do different work that is net negative, or that your work is coopted in that direction? Perhaps RLHF is useful alignment research, but it also pushes on commercialization bottlenecks and fuels arms races between Google and Microsoft. That’s a “second order” effect that you don’t want to ignore. It takes a lot of courage to ignore pressure from a company providing you with your job once you’ve taken the role.
More generally, I don’t think there’s a hard line between alignment and capabilities. I expect (not that I’m that knowledgable) that much alignment work (particularly interpretability) will fuel capabilities gains before it helps with alignment. I think anyone doing this work ought to think about it.
I have seen an abuser befriend people who are trusted and whom the abuser is nice to. This gives them credibility to harm others and then have rumors/accusations doubted, because hey, they’re friends with such upstanding people. I worry about a kind “safetywashing” where are a company that is overall doing harm, makes it self look better by putting out some genuinely good alignment work. The good alignment appearance maintains a good reputation which allows for recruiting capable and valuable talent, getting more investment, etc.
I think this is a way in which one’s work can locally be good but via pretty significant second order effects be very negative.
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and “meaning” means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
“is this a role from which I will push forward alignment faster than I advance capabilities?” is a very different question than “does this job have ‘alignment’ in the title?” I assume when it’s put like that you would choose the first question, but in practice a lot of people take jobs that are predictable nos to the first question and justify it by claiming they’re alignment jobs. Given that, I think it’s good Ruby pushed back on something that would end up supporting the latter form of the question, even if it wasn’t intended.
I just want to add that “whether you should consider applying” probably depends massively on what role you’re applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Not saying you intended this, but I worry about people thinking “it’s a an alignment role and therefore good” when considering joining companies that are pushing state of the art, and not thinking about it much harder than that.
What else should people be thinking about? You’d want to be sure that you’ll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
People should be thinking about:
If you get truly choose your own work, is your judgment on what will help with alignment good? (this might be true for senior hires like evhub, unsure about others getting to choose)
If you are joining existing alignment teams, is their work actually good for reducing AI x-risk vs the opposite? For example, both OpenAI and Anthropic do some variant of RLHF, which is pretty controversial – as a prospective hire, have you formed a solid opinion on this question vs relying on the convenient answer that at least some people regard it as alignment?
What is the likelihood that you are asked/pressured to do different work that is net negative, or that your work is coopted in that direction? Perhaps RLHF is useful alignment research, but it also pushes on commercialization bottlenecks and fuels arms races between Google and Microsoft. That’s a “second order” effect that you don’t want to ignore. It takes a lot of courage to ignore pressure from a company providing you with your job once you’ve taken the role.
More generally, I don’t think there’s a hard line between alignment and capabilities. I expect (not that I’m that knowledgable) that much alignment work (particularly interpretability) will fuel capabilities gains before it helps with alignment. I think anyone doing this work ought to think about it.
I have seen an abuser befriend people who are trusted and whom the abuser is nice to. This gives them credibility to harm others and then have rumors/accusations doubted, because hey, they’re friends with such upstanding people. I worry about a kind “safetywashing” where are a company that is overall doing harm, makes it self look better by putting out some genuinely good alignment work. The good alignment appearance maintains a good reputation which allows for recruiting capable and valuable talent, getting more investment, etc.
I think this is a way in which one’s work can locally be good but via pretty significant second order effects be very negative.
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and “meaning” means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
Thanks!
“is this a role from which I will push forward alignment faster than I advance capabilities?” is a very different question than “does this job have ‘alignment’ in the title?” I assume when it’s put like that you would choose the first question, but in practice a lot of people take jobs that are predictable nos to the first question and justify it by claiming they’re alignment jobs. Given that, I think it’s good Ruby pushed back on something that would end up supporting the latter form of the question, even if it wasn’t intended.