EDIT: Originally I said that was my best understanding of Mikhail’s point. Mikhail has told me it was not his point. I’m keeping this comment as that’s a point that I find interesting personally.
Before Mikhail released this post, we talked for multiple house about the goal of the article and how to communicate it better. I don’t like the current structure of the post, but I think Mikhail has good arguments and has gathered important data.
Here’s the point I would have made instead:
Anthropic presents itself as the champion of AI safety among the AI companies. People join Anthropic because of their trust that the Anthropic leadership will take the best decisions to make the future go well.
There have been a number of incidents, detailed in this post, where it seems clear that Anthropic went against a commitment they were expected to have (pushing the frontier), where their communication was misleading (like misrepresenting the RAISE bill), or where they took actions that seem incongruous with their stated mission (like accepting investment from Gulf states).
All of those incidents most likely have explanations that were communicated internally to the Anthropic employees. Those explanations make sense, and employees believe that the leadership made the right choice.
However, from the outside, a lot of those actions look like Anthropic gradually moving away from being the company that can be trusted to do what’s best for humanity. It looks like Anthropic doing whatever it can to win the race even if it increases risks, like all the other AI companies. From the outside, it looks like Anthropic is less special than it seemed at first.
There are two worlds compatible with the observations:
One where Anthropic is still pursuing the mission, and those incidents were just the best way to pursue the mission. The Anthropic leadership is trustworthy, and their internal explanations are valid and represent their actual view.
One where the Anthropic leadership is not reliably pursing the mission anymore, where those incidents are in fact evidence of that, but where the leadership is using its capacity of persuasion and the fact that it has access to info employees don’t have, to convince them that it was all for the mission, no matter the real reasons
In the second world, working at Anthropic would not reliably improve the world. Anthropic employees would have to evaluate whether to continue working there in the same way as they would if they worked at OpenAI or any other AI company.
All current and potential Anthropic employees should notice that from the outside, it sure does look like Anthropic is not following its mission as much as it used to. There are two hypotheses that explain it. They should make sure to keep tracking both of them. They should have a plan of what they’ll do if they’re in the least convenient world, so they can face uncomfortable evidence. And, if they do conclude that the Anthropic leadership is not following Anthropic’s mission anymore, they should take action.
(I do not endorse any of this, except for the last two sentences, though those are not a comprehensive bottom line. The comment is wrong about my points, my view, what I know, my model of the world, the specific hypotheses I’d want people to consider, etc.
If you think there is an important point to make, I’d appreciate it if you could make it without attributing it to others.)
I feel like the epistemic qualifier at the top was pretty clear about the state of the belief, even if Lucie was wrong! I would not call this “attributing it to others”, like nobody is going to quote this in an authoritative tone as something you said, unless the above is really a very egregious summary, but that currently seems unlikely to me.
I endorse the spirit of this distillation a lot more than the original post, though I note that Mikhail doesn’t seem to agree.
I don’t think those two worlds are the most helpful ones to consider, though. I think it’s extremely implausible[1] that Anthropic leadership are acting in some coordinated fashion to deceive employees about their pursuit of the mission while actually profit-maxxing or something.
I think the much more plausible world to watch out for is something like:
Anthropic leadership is reliably trying to pursue the mission and is broadly acting with good intentions, but some of Anthropic’s actions are bad for that mission for reasons like:
incorrect or biased beliefs by Anthropic leadership about what would be best for that mission
selective or biased reporting of things in self-serving ways by leadership in ordinary human ways of the sort that don’t feel internally like deception but can be easy to slip into under lots of social pressure
actions on the part of less-mission-aligned employees without sufficient oversight at higher levels of the org
decisionmakers who just haven’t really stopped to think about the consequences of their actions on some aspect of the mission, even though in theory they might realize this was bad
failures of competence in pursuing a good goal
random balls getting dropped for complicated big-organization reasons that aren’t any one person’s fault in a crisp way
Of course this is a spectrum, and this kind of thing will obviously be the case to some nonzero degree; the relevant questions are things like:
Which actors can I trust that if they’re owning some project, that project will be executed competently and with attention paid to the mission-relevant components that I care about?
What persistent biases do I think are present in this part of the org, and how could I improve that state of affairs?
Is the degree of failure in this regard large enough that my contributions to Anthropic-as-a-whole are net negative for the world?
What balls appear to be getting dropped, that I might be able to pick up?
What internal cultural changes would move decisionmaking in ways that would more reliably pursue the good?
I’d be excited for more external Anthropic criticism to pitch answers to questions like these.
I won’t go into all the reasons I think this, but just to name one, the whole org is peppered with the kinds of people who have quit OpenAI in protest over such actions, that’s such a rough environment to maintain this conspiracy in!
I agree that these are not the two worlds which would be helpful to consider, and your list of reasons are closer to my model than Lucie’s representation of my model.
(I do hope that my post somewhat decreases trust in Jack Clark and Dario Amodei and somewhat increases the incentives for the kind of governance that would not be dependent on trustworthy leadership to work.)
EDIT: Originally I said that was my best understanding of Mikhail’s point. Mikhail has told me it was not his point. I’m keeping this comment as that’s a point that I find interesting personally.
Before Mikhail released this post, we talked for multiple house about the goal of the article and how to communicate it better. I don’t like the current structure of the post, but I think Mikhail has good arguments and has gathered important data.
Here’s the point I would have made instead:
Anthropic presents itself as the champion of AI safety among the AI companies. People join Anthropic because of their trust that the Anthropic leadership will take the best decisions to make the future go well.
There have been a number of incidents, detailed in this post, where it seems clear that Anthropic went against a commitment they were expected to have (pushing the frontier), where their communication was misleading (like misrepresenting the RAISE bill), or where they took actions that seem incongruous with their stated mission (like accepting investment from Gulf states).
All of those incidents most likely have explanations that were communicated internally to the Anthropic employees. Those explanations make sense, and employees believe that the leadership made the right choice.
However, from the outside, a lot of those actions look like Anthropic gradually moving away from being the company that can be trusted to do what’s best for humanity. It looks like Anthropic doing whatever it can to win the race even if it increases risks, like all the other AI companies. From the outside, it looks like Anthropic is less special than it seemed at first.
There are two worlds compatible with the observations:
One where Anthropic is still pursuing the mission, and those incidents were just the best way to pursue the mission. The Anthropic leadership is trustworthy, and their internal explanations are valid and represent their actual view.
One where the Anthropic leadership is not reliably pursing the mission anymore, where those incidents are in fact evidence of that, but where the leadership is using its capacity of persuasion and the fact that it has access to info employees don’t have, to convince them that it was all for the mission, no matter the real reasons
In the second world, working at Anthropic would not reliably improve the world. Anthropic employees would have to evaluate whether to continue working there in the same way as they would if they worked at OpenAI or any other AI company.
All current and potential Anthropic employees should notice that from the outside, it sure does look like Anthropic is not following its mission as much as it used to. There are two hypotheses that explain it. They should make sure to keep tracking both of them. They should have a plan of what they’ll do if they’re in the least convenient world, so they can face uncomfortable evidence. And, if they do conclude that the Anthropic leadership is not following Anthropic’s mission anymore, they should take action.
(I do not endorse any of this, except for the last two sentences, though those are not a comprehensive bottom line. The comment is wrong about my points, my view, what I know, my model of the world, the specific hypotheses I’d want people to consider, etc.
If you think there is an important point to make, I’d appreciate it if you could make it without attributing it to others.)
I feel like the epistemic qualifier at the top was pretty clear about the state of the belief, even if Lucie was wrong! I would not call this “attributing it to others”, like nobody is going to quote this in an authoritative tone as something you said, unless the above is really a very egregious summary, but that currently seems unlikely to me.
Edited to say it is not your position. I’m sorry for having published this comment without checking with you.
I endorse the spirit of this distillation a lot more than the original post, though I note that Mikhail doesn’t seem to agree.
I don’t think those two worlds are the most helpful ones to consider, though. I think it’s extremely implausible[1] that Anthropic leadership are acting in some coordinated fashion to deceive employees about their pursuit of the mission while actually profit-maxxing or something.
I think the much more plausible world to watch out for is something like:
Anthropic leadership is reliably trying to pursue the mission and is broadly acting with good intentions, but some of Anthropic’s actions are bad for that mission for reasons like:
incorrect or biased beliefs by Anthropic leadership about what would be best for that mission
selective or biased reporting of things in self-serving ways by leadership in ordinary human ways of the sort that don’t feel internally like deception but can be easy to slip into under lots of social pressure
actions on the part of less-mission-aligned employees without sufficient oversight at higher levels of the org
decisionmakers who just haven’t really stopped to think about the consequences of their actions on some aspect of the mission, even though in theory they might realize this was bad
failures of competence in pursuing a good goal
random balls getting dropped for complicated big-organization reasons that aren’t any one person’s fault in a crisp way
Of course this is a spectrum, and this kind of thing will obviously be the case to some nonzero degree; the relevant questions are things like:
Which actors can I trust that if they’re owning some project, that project will be executed competently and with attention paid to the mission-relevant components that I care about?
What persistent biases do I think are present in this part of the org, and how could I improve that state of affairs?
Is the degree of failure in this regard large enough that my contributions to Anthropic-as-a-whole are net negative for the world?
What balls appear to be getting dropped, that I might be able to pick up?
What internal cultural changes would move decisionmaking in ways that would more reliably pursue the good?
I’d be excited for more external Anthropic criticism to pitch answers to questions like these.
I won’t go into all the reasons I think this, but just to name one, the whole org is peppered with the kinds of people who have quit OpenAI in protest over such actions, that’s such a rough environment to maintain this conspiracy in!
I agree that these are not the two worlds which would be helpful to consider, and your list of reasons are closer to my model than Lucie’s representation of my model.
(I do hope that my post somewhat decreases trust in Jack Clark and Dario Amodei and somewhat increases the incentives for the kind of governance that would not be dependent on trustworthy leadership to work.)