I think it’s sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word “trust” in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they’re going to do in the future in a wide variety of situations that might come up; I definitely don’t think that this is the case for Anthropic. This is partially because it’s generally hard to take companies literally when they say things, and partially because Anthropic leadership aren’t as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
On the general topic of whether it’s good for Anthropic to be powerful, I think that it’s also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it’s plausible that in the future they’ll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we’re likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they’re honest.
I also recommend Holden Karnofsky’s notes on trusting AI companies, summarized here.
I think it’s sort of a type error to refer to Anthropic as something that one could trust or not.
Note that while the title refers to “Anthropic”, the post very clearly discusses Anthropic’s leadership, in general and in specific, and discusses Anthropic staff separately.
I kinda agree that it’s kinda a type error—but also you have a moral obligation not to be eaten by the sort of process that would eat people, such as “pretend to be appropriately concerned with X-risk in order to get social approval from EA / X-deriskers, including funding and talent, and also act against those interests”.
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
Could you give a more specific example, that’s among the strongest such examples?
It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better.
I don’t get this. You said that you yourself think Anthropic leadership is noticeably less honest (than people around here), and less concerned about alignment difficult than you are. Given that, and also given that they clearly have very strong incentives to act against X-derisking interests, and given that their actions seem against X-derisking interests, and (AFAIK?) they haven’t credibly defended those actions (e.g. re/ SB 1047) in terms of X-derisking, what else could one be waiting to see before judging Anthropic leadership on the dimension of aiming for X-derisking and/or accurately representing their X-derisking stance?
I don’t think I have a moral obligation not to do that. I’m a guy who wants to do good in the world and I try to do stuff that I think is good, and I try to follow policies such that I’m easy to work with and so on. I think it’s pretty complicated to decide how averse you should be to taking on the risk of being eaten by some kind of process.
When I was 23, I agreed to work at MIRI on a non-public project. That’s a really risky thing to do for your epistemics etc. I knew that it was a risk at the time, but decided to take the risk anyway. I think it is sensible for people to sometimes take risks like this. (For what it’s worth, MIRI was aware that getting people to work on secret projects is a kind of risky thing to do, and they put some effort into mitigating the risks.)
For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
Could you give a more specific example, that’s among the strongest such examples?
I think it’s probably good that Anthropic has pushed the capabilities frontier, and I think a lot of the arguments that this is unacceptable are kind of wrong. If Anthropic staff had pushed back on this more, I think probably the world would be a worse place. (I do think Anthropic leadership was either dishonest or negligently-bad-at-self-modeling about whether they’d push the capabilities frontier.)
I think it is sensible for people to sometimes take risks like this.
I agree. If I say “you have a moral obligation not to cause anyone’s death”, that doesn’t mean “spend all of your energy absolutely minimizing the chances that your decisions minutely increase the risk of someone dying”. But it does mean “when you’re likely having significant effects on the chances of that happening, you should spend the effort required to mostly eliminate those risks, or avoid the situation, or at least signpost the risks very clearly, etc.”. In this case, yeah, I’m saying you do have a strong obligation, which can often require work and some amount of other cost, to not give big amounts of support to processes that are causing a bunch of harm. Like any obligation it’s not simplistic or absolute, but it’s there. Maybe we still disagree about this.
I think it’s pretty complicated to decide how averse you should be to taking on the risk of being eaten by some kind of process.
True, but basically I’m saying “it’s really important and also a lot of the responsibility falls on you, and/or on your community / whoever you’re deferring to about these questions”. Like, it just is really costly to be supporting bad processes like this. In some cases you want to pay the costs, but it’s still a big cost. I’m definitely definitely not saying “all Anthropic employees are bad” or something. Some of the research seems neutral or helpful or maybe very helpful (for legibilizing dangers). But I do think there’s a big obligation of due diligence about “is the company I’m devoting my working energy to, working towards really bad stuff in the world”. For example, yes, Anthropic employees have an obligation to call out if the company leadership is advocating against regulation. (Which maybe they have been doing! In which case the obligation is probably met!)
I think it’s probably good that Anthropic has pushed the capabilities frontier, and I think a lot of the arguments that this is unacceptable are kind of wrong.
Oh. Link to an argument for this?
I didn’t understand your last paragraph.
If you’re curious, basically I’m saying, “yes there’s context but people in the space have a voice, and have obligations, and do have a bunch of the relevant context; what else would they need?”. I mean, it kind of sounds like you’re saying we (someone) should just trust Anthropic leadership because they have more context, even if there’s not much indication that they have good intents? That can’t be what you mean(?) but it sounds like that.
I agree that treating corporations or governments or countries as single coherent individuals is a type error, since it’s important to be able to decompose them into factions and actors to build a good gears-level model that is predictive, and you can easily miss that. I strongly disagree that treating them as actors which can be trusted or distrusted is a type error. You seem to be making the second claim, and I don’t understand it; the company makes decisions, and you can either trust it to do what it says, or not—and this post says the latter is the better model for anthropic.
Of course, the fact that you can’t trust a given democracy to keep its promises doesn’t mean you can’t trust any of the individuals in it, and the fact that you can’t trust a given corporation doesn’t necessarily mean that about the individuals working for the company either. (It doesn’t even mean you can’t trust each of the individual people in charge—clearly, trust isn’t necessarily conserved over most forms of preference or decision aggregation.)
But as stated, the claims made seem reasonable, and in my view, the cited evidence shows it’s basically correct, about the company as an entity and its trustworthiness.
For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
Specific examples would be appreciated.
Do you mean things like opposition to open-source? Opposition to pushing-the-SOTA model releases?
Some unstructured thoughts:
I think it’s sort of a type error to refer to Anthropic as something that one could trust or not. Anthropic is a company which has a bunch of executives, employees, board members, LTBT members, external contractors, investors, etc, all of whom have influence over different things the company does.
I think the main case where people are tempted to use the word “trust” in connection with Anthropic is when they are trying to decide how good it is to make Anthropic generically more powerful, e.g. by working there on AI capabilities.
I do think that many people (including most Anthropic staff) are well described as trusting Anthropic too much. For example, some people are trustworthy in the sense that things they say make it pretty easy to guess what they’re going to do in the future in a wide variety of situations that might come up; I definitely don’t think that this is the case for Anthropic. This is partially because it’s generally hard to take companies literally when they say things, and partially because Anthropic leadership aren’t as into being truthful as, for example, rationalists are. I think that many Anthropic staff take Anthropic leadership at its word to an extent that degrades their understanding of AI-risk-relevant questions.
But is that bad? It’s complicated by the fact that it’s quite challenging to have enough context on the AI risk situation that you can actually second-guess Anthropic leadership in a way that overall makes the situation better. Most AI-safety-concerned people who work at Anthropic spend most of their time trying to do their job instead of thinking a lot about e.g. what should happen on state legislation; I think it would take a lot of time for them to get confident enough that Anthropic was behaving badly that it would add value for them to try to pressure Anthropic (except by somehow delegating this judgement call to someone who is less COI-ed and who can amortize this work).
I think that in some cases in the past, Anthropic leadership did things that safety-concerned staff wouldn’t have liked, and where Anthropic leadership looks like they made the right call in hindsight. For example, I think AI safety people often have sort of arbitrary strong takes about things that would be very bad to do, and it’s IMO sometimes been good that Anthropic leadership hasn’t been very pressured by their staff.
On the general topic of whether it’s good for Anthropic to be powerful, I think that it’s also a big problem that Anthropic leadership is way less worried than I am about AIs being egregiously misaligned; I think it’s plausible that in the future they’ll take actions that I think are very bad for AI risk. (For example, I think that in the face of ambiguous evidence about AI misalignment that I think we’re likely to get, they are much more likely than I would be to proceed with building more powerful models.) This has nothing to do with whether they’re honest.
I also recommend Holden Karnofsky’s notes on trusting AI companies, summarized here.
Note that while the title refers to “Anthropic”, the post very clearly discusses Anthropic’s leadership, in general and in specific, and discusses Anthropic staff separately.
I kinda agree that it’s kinda a type error—but also you have a moral obligation not to be eaten by the sort of process that would eat people, such as “pretend to be appropriately concerned with X-risk in order to get social approval from EA / X-deriskers, including funding and talent, and also act against those interests”.
Could you give a more specific example, that’s among the strongest such examples?
I don’t get this. You said that you yourself think Anthropic leadership is noticeably less honest (than people around here), and less concerned about alignment difficult than you are. Given that, and also given that they clearly have very strong incentives to act against X-derisking interests, and given that their actions seem against X-derisking interests, and (AFAIK?) they haven’t credibly defended those actions (e.g. re/ SB 1047) in terms of X-derisking, what else could one be waiting to see before judging Anthropic leadership on the dimension of aiming for X-derisking and/or accurately representing their X-derisking stance?
I don’t think I have a moral obligation not to do that. I’m a guy who wants to do good in the world and I try to do stuff that I think is good, and I try to follow policies such that I’m easy to work with and so on. I think it’s pretty complicated to decide how averse you should be to taking on the risk of being eaten by some kind of process.
When I was 23, I agreed to work at MIRI on a non-public project. That’s a really risky thing to do for your epistemics etc. I knew that it was a risk at the time, but decided to take the risk anyway. I think it is sensible for people to sometimes take risks like this. (For what it’s worth, MIRI was aware that getting people to work on secret projects is a kind of risky thing to do, and they put some effort into mitigating the risks.)
I think it’s probably good that Anthropic has pushed the capabilities frontier, and I think a lot of the arguments that this is unacceptable are kind of wrong. If Anthropic staff had pushed back on this more, I think probably the world would be a worse place. (I do think Anthropic leadership was either dishonest or negligently-bad-at-self-modeling about whether they’d push the capabilities frontier.)
I didn’t understand your last paragraph.
I agree. If I say “you have a moral obligation not to cause anyone’s death”, that doesn’t mean “spend all of your energy absolutely minimizing the chances that your decisions minutely increase the risk of someone dying”. But it does mean “when you’re likely having significant effects on the chances of that happening, you should spend the effort required to mostly eliminate those risks, or avoid the situation, or at least signpost the risks very clearly, etc.”. In this case, yeah, I’m saying you do have a strong obligation, which can often require work and some amount of other cost, to not give big amounts of support to processes that are causing a bunch of harm. Like any obligation it’s not simplistic or absolute, but it’s there. Maybe we still disagree about this.
True, but basically I’m saying “it’s really important and also a lot of the responsibility falls on you, and/or on your community / whoever you’re deferring to about these questions”. Like, it just is really costly to be supporting bad processes like this. In some cases you want to pay the costs, but it’s still a big cost. I’m definitely definitely not saying “all Anthropic employees are bad” or something. Some of the research seems neutral or helpful or maybe very helpful (for legibilizing dangers). But I do think there’s a big obligation of due diligence about “is the company I’m devoting my working energy to, working towards really bad stuff in the world”. For example, yes, Anthropic employees have an obligation to call out if the company leadership is advocating against regulation. (Which maybe they have been doing! In which case the obligation is probably met!)
Oh. Link to an argument for this?
If you’re curious, basically I’m saying, “yes there’s context but people in the space have a voice, and have obligations, and do have a bunch of the relevant context; what else would they need?”. I mean, it kind of sounds like you’re saying we (someone) should just trust Anthropic leadership because they have more context, even if there’s not much indication that they have good intents? That can’t be what you mean(?) but it sounds like that.
I agree that treating corporations or governments or countries as single coherent individuals is a type error, since it’s important to be able to decompose them into factions and actors to build a good gears-level model that is predictive, and you can easily miss that. I strongly disagree that treating them as actors which can be trusted or distrusted is a type error. You seem to be making the second claim, and I don’t understand it; the company makes decisions, and you can either trust it to do what it says, or not—and this post says the latter is the better model for anthropic.
Of course, the fact that you can’t trust a given democracy to keep its promises doesn’t mean you can’t trust any of the individuals in it, and the fact that you can’t trust a given corporation doesn’t necessarily mean that about the individuals working for the company either. (It doesn’t even mean you can’t trust each of the individual people in charge—clearly, trust isn’t necessarily conserved over most forms of preference or decision aggregation.)
But as stated, the claims made seem reasonable, and in my view, the cited evidence shows it’s basically correct, about the company as an entity and its trustworthiness.
I don’t really disagree with anything you said here. (Edit to add: except that I don’t agree with the OP’s interpretation of all the evidence listed.)
Specific examples would be appreciated.
Do you mean things like opposition to open-source? Opposition to pushing-the-SOTA model releases?
(I see that you offered the second as an example to Tsvi.)