I want to make a thing that talks about why people shouldn’t work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn’t going to work. I think you’d probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.
Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful in practice.
I think it’s relatively common for people I encounter to think both:
Anthropic leadership is engaged in somewhat scumy appeasment of safety motivated employees in ways that are misleading or based on kinda obviously motivated reasoning. (Which results in safety motivated employees having a misleading picture of what the organization is doing and why and what people expect to happen.)
Anthropic is strongly net positive despite this and working on capabilities there is among the best things you can do.
An underlying part of this view is typically that moderate improvements in effort spent on prosaic safety measures substantially reduces risk. I think you probably strongly disagree with this and this might be a major crux.
Personally, I agreee with what Zach said. I think working on capabilities[1] at Anthropic is probably somewhat net positive but would only be the best thing to work on if you had very strong comparative advantage relative to all the other useful stuff (e.g. safety research). So probably most altruistic people with views similar to mine should do something else. I currently don’t feel very confident that capabilities at Anthropic is net positive and could imagine swinging towards thinking it is net negative based on additional evidence
fwiw I agree with most but not all details, and I agree that Anthropic’s commitments and policy advocacy have a bad track record, but I think that Anthropic capabilities is nevertheless net positive, because Anthropic has way more capacity and propensity to do safety stuff than other frontier AI companies.
I wonder what you believe about Anthropic’s likelihood of noticing risks from misalignment relative to other companies, or of someday spending >25% of internal compute on (automated) safety work.
If people work for Anthropic because they’re misled about the nature of the company, I don’t think arguments on whether they’re net-positive have any local relevance.
Still, to reply: They are one of the companies in the race to kill everyone.
Spending compute on automated safety work does not help. If the system you’re running is superhuman, it kills you instead of doing your alignment homework; if it’s not superhuman, it can’t solve your alignment homework.
Anthropic is doing some great research; but as a company at the frontier, their main contribution could’ve been making sure that no one builds ASI until it’s safe; that there’s legislation that stops the race to the bottom; that the governments understand the problem and want to regulate; that the public is informed of what’s going on and what legislation proposes.
Instead, Anthropic argues against regulation in private, lies about legislation in public, misleads its employees about its role in various things.
***
If Anthropic had to not stay at the frontier to be able to spend >25% of their compute on safety, do you expect they would?
Do you really have a coherent picture of the company in mind, where it is doing all the things it’s doing now (such as not taking steps that would slow down everyone), and yet would behave responsibly when it matters most and also pressure not to is the highest?
I recall a video circulating that showed Dario had changed his position on racing with China that feels perhaps relevant. People can of course change their mind, but I still dislike it.
I want to make a thing that talks about why people shouldn’t work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
A very early version of what it might look like: https://anthropic.ml
Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)
If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn’t going to work. I think you’d probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.
Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful in practice.
I think it’s relatively common for people I encounter to think both:
Anthropic leadership is engaged in somewhat scumy appeasment of safety motivated employees in ways that are misleading or based on kinda obviously motivated reasoning. (Which results in safety motivated employees having a misleading picture of what the organization is doing and why and what people expect to happen.)
Anthropic is strongly net positive despite this and working on capabilities there is among the best things you can do.
An underlying part of this view is typically that moderate improvements in effort spent on prosaic safety measures substantially reduces risk. I think you probably strongly disagree with this and this might be a major crux.
Personally, I agreee with what Zach said. I think working on capabilities[1] at Anthropic is probably somewhat net positive but would only be the best thing to work on if you had very strong comparative advantage relative to all the other useful stuff (e.g. safety research). So probably most altruistic people with views similar to mine should do something else. I currently don’t feel very confident that capabilities at Anthropic is net positive and could imagine swinging towards thinking it is net negative based on additional evidence
Putting aside strongly differential specific capabilities work.
fwiw I agree with most but not all details, and I agree that Anthropic’s commitments and policy advocacy have a bad track record, but I think that Anthropic capabilities is nevertheless net positive, because Anthropic has way more capacity and propensity to do safety stuff than other frontier AI companies.
I wonder what you believe about Anthropic’s likelihood of noticing risks from misalignment relative to other companies, or of someday spending >25% of internal compute on (automated) safety work.
If people work for Anthropic because they’re misled about the nature of the company, I don’t think arguments on whether they’re net-positive have any local relevance.
Still, to reply: They are one of the companies in the race to kill everyone.
Spending compute on automated safety work does not help. If the system you’re running is superhuman, it kills you instead of doing your alignment homework; if it’s not superhuman, it can’t solve your alignment homework.
Anthropic is doing some great research; but as a company at the frontier, their main contribution could’ve been making sure that no one builds ASI until it’s safe; that there’s legislation that stops the race to the bottom; that the governments understand the problem and want to regulate; that the public is informed of what’s going on and what legislation proposes.
Instead, Anthropic argues against regulation in private, lies about legislation in public, misleads its employees about its role in various things.
***
If Anthropic had to not stay at the frontier to be able to spend >25% of their compute on safety, do you expect they would?
Do you really have a coherent picture of the company in mind, where it is doing all the things it’s doing now (such as not taking steps that would slow down everyone), and yet would behave responsibly when it matters most and also pressure not to is the highest?
I recall a video circulating that showed Dario had changed his position on racing with China that feels perhaps relevant. People can of course change their mind, but I still dislike it.
Do you think the world would be a better place if Anthropic didn’t exist?
It’s probably better in the short-term while also making the short-term shorter (which is way worse).