massive staff size of just good engineers i.e. not the sort of x-risk-conscious people who would gladly stop their work if the leadership thought it was getting too close to AGI
From my interactions with engineers at Anthropic so far, I think is a mischaracterization. I think the vast majority are in fact pretty x-risk conscious and my guess is that in fact if leadership said stop people would be happy to stop.
engineering leadership would not feel very concerned if their systems showed signs of deception
I’ve had personal conversations with Anthropic leadership about this and can confirm that it is definitely false; I think Jared and Dario would be quite concerned if they saw deceptive alignment in their models.
That’s good to hear you think that! I’d find it quite helpful to know the results of a survey to the former effect, of the (40? 80?) ML engineers and researchers there, anonymously answering a question like “Insofar as your job involves building large language models, if Dario asked you to stop your work for 2 years while still being paid your salary, how likely would you be to do so (assume the alternative is being fired)? (1-10, Extremely Unlikely, Extremely Likely)” and the same question but “Condition on it looking to you like Anthropic and OpenAI are both 1-3 years from building AGI”. I’d find that evidence quite informative. Hat tip to Habryka for suggesting roughly this question to me a year ago.
(I’m available and willing to iterate on a simple survey to that effect if you are too, and can do some iteration/user-testing with other people.)
(I’ll note that if the org doubles in size every year or two then… well, I don’t know how many x-risk conscious engineers you’ll get, or what sort of enculturation Anthropic will do in order to keep the answer to this up at 90%+.)
Regarding the latter, I’ve DM’d you about the specifics.
From my interactions with engineers at Anthropic so far, I think is a mischaracterization. I think the vast majority are in fact pretty x-risk conscious and my guess is that in fact if leadership said stop people would be happy to stop.
I’ve had personal conversations with Anthropic leadership about this and can confirm that it is definitely false; I think Jared and Dario would be quite concerned if they saw deceptive alignment in their models.
That’s good to hear you think that! I’d find it quite helpful to know the results of a survey to the former effect, of the (40? 80?) ML engineers and researchers there, anonymously answering a question like “Insofar as your job involves building large language models, if Dario asked you to stop your work for 2 years while still being paid your salary, how likely would you be to do so (assume the alternative is being fired)? (1-10, Extremely Unlikely, Extremely Likely)” and the same question but “Condition on it looking to you like Anthropic and OpenAI are both 1-3 years from building AGI”. I’d find that evidence quite informative. Hat tip to Habryka for suggesting roughly this question to me a year ago.
(I’m available and willing to iterate on a simple survey to that effect if you are too, and can do some iteration/user-testing with other people.)
(I’ll note that if the org doubles in size every year or two then… well, I don’t know how many x-risk conscious engineers you’ll get, or what sort of enculturation Anthropic will do in order to keep the answer to this up at 90%+.)
Regarding the latter, I’ve DM’d you about the specifics.