I think it’s sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word “trust” in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they’re going to do in the future in a wide variety of situations that might come up; I definitely don’t think that this is the case for Anthropic. This is partially because it’s generally hard to take companies literally when they say things, and partially because Anthropic leadership aren’t as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
On the general topic of whether it’s good for Anthropic to be powerful, I think that it’s also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it’s plausible that in the future they’ll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we’re likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they’re honest.
I also recommend Holden Karnofsky’s notes on trusting AI companies, summarized here.
I think it’s sort of a type error to refer to Anthropic as something that one could trust or not.
Note that while the title refers to “Anthropic”, the post very clearly discusses Anthropic’s leadership, in general and in specific, and discusses Anthropic staff separately.
I kinda agree that it’s kinda a type error—but also you have a moral obligation not to be eaten by the sort of process that would eat people, such as “pretend to be appropriately concerned with X-risk in order to get social approval from EA / X-deriskers, including funding and talent, and also act against those interests”.
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
Could you give a more specific example, that’s among the strongest such examples?
It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better.
I don’t get this. You said that you yourself think Anthropic leadership is noticeably less honest (than people around here), and less concerned about alignment difficult than you are. Given that, and also given that they clearly have very strong incentives to act against X-derisking interests, and given that their actions seem against X-derisking interests, and (AFAIK?) they haven’t credibly defended those actions (e.g. re/ SB 1047) in terms of X-derisking, what else could one be waiting to see before judging Anthropic leadership on the dimension of aiming for X-derisking and/or accurately representing their X-derisking stance?
Some unstructured thoughts:
I think it’s sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word “trust” in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they’re going to do in the future in a wide variety of situations that might come up; I definitely don’t think that this is the case for Anthropic. This is partially because it’s generally hard to take companies literally when they say things, and partially because Anthropic leadership aren’t as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
On the general topic of whether it’s good for Anthropic to be powerful, I think that it’s also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it’s plausible that in the future they’ll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we’re likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they’re honest.
I also recommend Holden Karnofsky’s notes on trusting AI companies, summarized here.
Note that while the title refers to “Anthropic”, the post very clearly discusses Anthropic’s leadership, in general and in specific, and discusses Anthropic staff separately.
I kinda agree that it’s kinda a type error—but also you have a moral obligation not to be eaten by the sort of process that would eat people, such as “pretend to be appropriately concerned with X-risk in order to get social approval from EA / X-deriskers, including funding and talent, and also act against those interests”.
Could you give a more specific example, that’s among the strongest such examples?
I don’t get this. You said that you yourself think Anthropic leadership is noticeably less honest (than people around here), and less concerned about alignment difficult than you are. Given that, and also given that they clearly have very strong incentives to act against X-derisking interests, and given that their actions seem against X-derisking interests, and (AFAIK?) they haven’t credibly defended those actions (e.g. re/ SB 1047) in terms of X-derisking, what else could one be waiting to see before judging Anthropic leadership on the dimension of aiming for X-derisking and/or accurately representing their X-derisking stance?