If you get truly choose your own work, is your judgment on what will help with alignment good? (this might be true for senior hires like evhub, unsure about others getting to choose)
If you are joining existing alignment teams, is their work actually good for reducing AI x-risk vs the opposite? For example, both OpenAI and Anthropic do some variant of RLHF, which is pretty controversial – as a prospective hire, have you formed a solid opinion on this question vs relying on the convenient answer that at least some people regard it as alignment?
What is the likelihood that you are asked/pressured to do different work that is net negative, or that your work is coopted in that direction? Perhaps RLHF is useful alignment research, but it also pushes on commercialization bottlenecks and fuels arms races between Google and Microsoft. That’s a “second order” effect that you don’t want to ignore. It takes a lot of courage to ignore pressure from a company providing you with your job once you’ve taken the role.
More generally, I don’t think there’s a hard line between alignment and capabilities. I expect (not that I’m that knowledgable) that much alignment work (particularly interpretability) will fuel capabilities gains before it helps with alignment. I think anyone doing this work ought to think about it.
I have seen an abuser befriend people who are trusted and whom the abuser is nice to. This gives them credibility to harm others and then have rumors/accusations doubted, because hey, they’re friends with such upstanding people. I worry about a kind “safetywashing” where are a company that is overall doing harm, makes it self look better by putting out some genuinely good alignment work. The good alignment appearance maintains a good reputation which allows for recruiting capable and valuable talent, getting more investment, etc.
I think this is a way in which one’s work can locally be good but via pretty significant second order effects be very negative.
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and “meaning” means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
People should be thinking about:
If you get truly choose your own work, is your judgment on what will help with alignment good? (this might be true for senior hires like evhub, unsure about others getting to choose)
If you are joining existing alignment teams, is their work actually good for reducing AI x-risk vs the opposite? For example, both OpenAI and Anthropic do some variant of RLHF, which is pretty controversial – as a prospective hire, have you formed a solid opinion on this question vs relying on the convenient answer that at least some people regard it as alignment?
What is the likelihood that you are asked/pressured to do different work that is net negative, or that your work is coopted in that direction? Perhaps RLHF is useful alignment research, but it also pushes on commercialization bottlenecks and fuels arms races between Google and Microsoft. That’s a “second order” effect that you don’t want to ignore. It takes a lot of courage to ignore pressure from a company providing you with your job once you’ve taken the role.
More generally, I don’t think there’s a hard line between alignment and capabilities. I expect (not that I’m that knowledgable) that much alignment work (particularly interpretability) will fuel capabilities gains before it helps with alignment. I think anyone doing this work ought to think about it.
I have seen an abuser befriend people who are trusted and whom the abuser is nice to. This gives them credibility to harm others and then have rumors/accusations doubted, because hey, they’re friends with such upstanding people. I worry about a kind “safetywashing” where are a company that is overall doing harm, makes it self look better by putting out some genuinely good alignment work. The good alignment appearance maintains a good reputation which allows for recruiting capable and valuable talent, getting more investment, etc.
I think this is a way in which one’s work can locally be good but via pretty significant second order effects be very negative.
Personally, if you working with cutting edge LLMs, you need to pass a high burden of proof/reasoning that this is good. Incentives like prestige, salary, and “meaning” means ought to question oneself pretty hard when doing the equivalent of entering the nuclear bombs or conventional arms manufacturing industries (especially during war times).
Thanks!