Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev
Zac Hatfield-Dodds(Zac Hatfield-Dodds)
I don’t think any of these amount to a claim that “to reach ASI, we simply need to develop rules for all the domains we care about”. Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that’s a nice demonstration that synthetic data could be useful, but it’s not about ASI and there’s no claim that this would be either necessary or sufficient.
(not going to speculate on object-level details though)
I think that personal incentives is an unhelpful way to try and think about or predict board behavior (for Anthropic and in general), but you can find the current members of our board listed here.
Is there an actual way to criticize Dario and/or Daniela in a way that will realistically be given a fair hearing by someone who, if appropriate, could take some kind of action?
For whom to criticize him/her/them about what? What kind of action are you imagining? For anything I can imagine actually coming up, I’d be personally comfortable raising it directly with either or both of them in person or in writing, and believe they’d give it a fair hearing as well as appropriate follow-up. There are also standard company mechanisms that many people might be more comfortable using (talk to your manager or someone responsible for that area; ask a maybe-anonymous question in various fora; etc). Ultimately executives are accountable to the board, which will be majority appointed by the long-term benefit trust from late this year.
Makes sense—if I felt I had to use an anonymous mechanism, I can see how contacting Daniela about Dario might be uncomfortable. (Although to be clear I actually think that’d be fine, and I’d also have to think that Sam McCandlish as responsible scaling officer wouldn’t handle it)
If I was doing this today I guess I’d email another board member; and I’ll suggest that we add that as an escalation option.
OK, let’s imagine I had a concern about RSP noncompliance, and felt that I needed to use this mechanism.
(in reality I’d just post in whichever slack channel seemed most appropriate; this happens occasionally for “just wanted to check...” style concerns and I’m very confident we’d welcome graver reports too. Usually that’d be a public channel; for some compartmentalized stuff it might be a private channel and I’d DM the team lead if I didn’t have access. I think we have good norms and culture around explicitly raising safety concerns and taking them seriously.)
As I understand it, I’d:
Remember that we have such a mechanism and bet that there’s a shortcut link. Fail to remember the shortlink name (reports? violations?) and search the list of “rsp-” links; ah, it’s rsp-noncompliance. (just did this, and added a few aliases)
That lands me on the policy PDF, which explains in two pages the intended scope of the policy, who’s covered, the proceedure, etc. and contains a link to the third-party anonymous reporting platform. That link is publicly accessible, so I could e.g. make a report from a non-work device or even after leaving the company.
I write a report on that platform describing my concerns[1], optionally uploading documents etc. and get a random password so I can log in later to give updates, send and receive messages, etc.
The report by default goes to our Responsible Scaling Officer, currently Sam McCandlish. If I’m concerned about the RSO or don’t trust them to handle it, I can instead escalate to the Board of Directors (current DRI Daniella Amodei)
Investigation and resolution obviously depends on the details of the noncompliance concern.
There are other (pretty standard) escalation pathways for concerns about things that aren’t RSP noncompliance. There’s not much we can do about the “only one person could have made this report” problem beyond the included strong commitments to non-retaliation, but if anyone has suggestions I’d love to hear them.
- ↩︎
I clicked through just now to the point of cursor-in-textbox, but not submitting a nuisance report.
I am a current Anthropic employee, and I am not under any such agreement, nor has any such agreement ever been offered to me.
If asked to sign a self-concealing NDA or non-disparagement agreement, I would refuse.
He talked to Gladstone AI founders a few weeks ago; AGI risks were mentioned but not in much depth.
see also: the tyranny of the marginal user
Incorporating as a Public Benefit Corporation already frees directors’ hands; Delaware Title 8, §365 requires them to “balance the pecuniary interests of the stockholders, the best interests of those materially affected by the corporation’s conduct, and the specific public benefit(s) identified in its certificate of incorporation”.
Without wishing to discourage these efforts, I disagree on a few points here:
Still, the biggest opportunities are often the ones with the lowest probability of success, and startups are the best structures to capitalize on them.
If I’m looking for the best expected value around, that’s still monotonic in the probability of success! There are good reasons to think that most organizations are risk-averse (relative to the neutrality of linear $=utils) and startups can be a good way to get around this.
Nonetheless, I remain concerned about regressional Goodhart; and that many founders naively take on the risk appetite of funders who manage a portfolio, without the corresponding diversification (if all your eggs are in one basket, watch that basket very closely). See also Inadequate Equilibria and maybe Fooled by Randomness.
Meanwhile, strongly agreed that AI safety driven startups should be B corps, especially if they’re raising money.
Technical quibble; “B Corp” is a voluntary private certification; PBC is a corporate form which imposes legal obligations on directors. I think many of the B Corp criteria are praiseworthy, but this is neither necessary nor sufficient as an alternative to PBC status—and getting certified is probably a poor use of time and attention for a startup when the founders’ time and attention are at such a premium.
My personal opinion is that starting a company can be great, but I’ve also seen several fail due to the gaps between their personal goals, a work-it-out-later business plan, and the duties that you/your board owes to your investors.
IMO any purpose-driven company should be founded as a Public Benefit Corporation, to make it clear in advance and in law that you’ll also consider the purpose and the interests of people materially affected by the company alongside investor returns. (cf § 365. Duties of directors)
Enforcement of mitigations when it’s someone else who removes them won’t be seen as relevant, since in this religion a contributor is fundamentally not responsible for how the things they release will be used by others.
This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that’s on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.
Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
That’s how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
Of course, but pausing also means we’d have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we’d continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.
The yellow-line evals are already a buffer (‘sufficent to rule out red-lines’) which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it’s reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round, and enough calibration training to avoid very low probabilities.
Finally, the point of these estimates is that they can guide research and development prioritization—high estimates suggest that it’s worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?
That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to “allow”, and I think it’d be cool if the AISI had something similar.
If I was currently concerned about RSP implementation per se (I’m not), it’s not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I’d look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.
“red line” vs “yellow line”
Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the “register a typo’d domain” step from an ARA eval, because there are only so many good typos for our domain.
assurance mechanisms
Our White House committments mean that we’re already reporting safety evals to the US Government, for example. I think the natural reading of “validated” is some combination of those, though obviously it’s very hard to validate that whatever you’re doing is ‘sufficient’ security against serious cyberattacks or safety interventions on future AI systems. We do our best.
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
- 4 Jul 2024 19:38 UTC; 5 points) 's comment on Habryka’s Shortform Feed by (
I believe that meeting our ASL-2 deployment commitments—e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models—with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights… I think that would be pretty cool.
(also note that e.g. LLama is not open source—I think you’re talking about releasing weights; the license doesn’t affect safety but as an open-source maintainer the distinction matters to me)
Anthropic: Reflections on our Responsible Scaling Policy
While some companies, such as OpenAI and Anthropic, have publicly advocated for AI regulation, Time reports that in closed-door meetings, these same companies “tend to advocate for very permissive or voluntary regulations.”
I think that dropping the intermediate text which describes ‘more established big tech companies’ such as Microsoft substantially changes the meaning of this quote—“these same companies” is not “OpenAI and Anthropic”. Full context:
Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems. But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches
AI lab watch makes it easy to get some background information by comparing committments made by OpenAI, Anthropic, Microsoft, and some other established big tech companies.
Meta’s Llama3 model is also *not *open source, despite the Chief AI Scientist at the company, Yann LeCun, frequently proclaiming that it is.
This is particularly annoying because he knows better: the latter two of those three tweets are from January 2024, and here’s video of his testimony under oath in September 2023: “the Llama system was not made open-source”.
Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you’d now recommend? (inspired by the ‘Attention—this is a historical document’ on e.g. this PEP)