Zac Hatfield-Dodds(Zac Hatfield-Dodds)

Karma: 2,067

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC

286 points

19 comments2 min readLW link

(transformer-circuits.pub)

Anthropic’s Core Views on AI Safety

Zac Hatfield-Dodds9 Mar 2023 16:55 UTC

181 points

39 comments2 min readLW link

(www.anthropic.com)

Concrete Reasons for Hope about AI

Zac Hatfield-Dodds14 Jan 2023 1:22 UTC

101 points

13 comments1 min readLW link

Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust

Zac Hatfield-Dodds19 Sep 2023 15:09 UTC

90 points

23 comments3 min readLW link

(www.anthropic.com)

Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy

Zac Hatfield-Dodds1 Nov 2023 18:10 UTC

85 points

1 comment4 min readLW link

(www.anthropic.com)

Zac Hatfield-Dodds 11 Nov 2021 6:02 UTC
LW: 63 AF: 16
AF
on: Discussion with Eliezer Yudkowsky on AGI interventions
I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I’m less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don’t know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.

And doing this wouldn’t actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system—and cache and timing attacks like Spectre probably aren’t yet either. Your proof system isn’t large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be $(- \infty, \infty)$ .

I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I’d be comfortable relying on.

Zac Hatfield-Dodds 21 Oct 2022 8:39 UTC
61 points
12
on: Covid 10/20/22: Wait, We Did WHAT?
Re: that mouse study… I’m disappointed to feel I can’t trust rationalist headlines without also personally checking for inconvenient details in the preprint and expert commentary (here’s Derek Lowe in Science, or Helen Branswell in Stat.) For example:
- The chimeric mix(es) in the study (wild-type plus Omicron spike protein) have previously been observed in the wild, in humans, and were outcompeted by other circulating strains. Seems important to me!
- The question of which mutations to Covid drive vaccine escape sure do seem relevant to pandemic response to me! Of course in a coherently-competent world we wouldn’t need this at all (remember that Australia and NZ both eliminated Covid with a short lockdown in early 2020?), but given the world we’re actually in I’m a big fan of escape-resistant vaccines—and getting them approved and delivered faster.
- 80% of transgenic mice, not humans died (or got fairly ill and were euthanized) to the modified strain, compared to 100% (of transgenic mice, not humans) to the original Wuhan wild-type virus. IN MICE is one of the best-known reporting biases! There’s literally a @justsaysinmice tweet about this study!
Yeah, there’s clearly been some level of misbehavior here, and the institutional responses documented above are clownish. This is clearly in the genre of unacceptably risky work that probably created and then leaked pandemic Covid, which I wish we’d cut back on and confine to labs in the remote desert instead of dense cities.

But I don’t think we’d be discussing this specific paper at all if the 80% MORTALITY! headlines hadn’t been amplified, I don’t think they were accurate, and I would prefer that we stop undermining the credibility of the rationalist community and important arguments against unacceptably risky research by failing to check the basic details before posting alarmist headlines.

If you must, I’d strongly prefer a framing like “This is another unfortunately-common example of research I think is unacceptably risky. Note the mismanagement, the inconsistent descriptions and justifications of this project and whether it constitutes ‘gain of function’, the history of bio lab leaks, etc. While research on immune escape could be valuable, reforming vaccine approval and distribution would be much more effective and less risky.”

Zac Hatfield-Dodds 19 Jul 2023 3:30 UTC
LW: 52 AF: 20
13
AF
on: Meta announces Llama 2; “open sources” it for commercial use
Llama 2 is not open source.

(a few days after this comment, here’s a concurring opinion from the Open Source Initiative—as close to authoritative as you can get)

(later again, here’s Yan LeCun testifying under oath: “so first of all Llama system was not made open source … we released it in a way that did not authorize commercial use, we kind of vetted the people who could download the model it was reserved to researchers and academics”)

While their custom licence permits some commercial uses, it is not an OSI approved license, and because it violates the open source definition it never will be. Specifically, the llama 2 licence violates:
- 1. Source code. It’s a little ambiguous what this means for a trained model; I’d claim that an open model release should include the training code (yes) and dataset (no), along with sufficient instructions for others to reproduce the results. However, you could also argue that weights are in fact “the preferred form in which a programmer would modify the program”, so this is not an important objection.
- 1. No Discrimination Against Persons or Groups. See the ban on use by anyone who has, or is affiliated with anyone who has, more than 700M active users. As a side note, Snapchat recently announced that they had 750M active users, so this looks pretty targeted at competing social media (including Tiktok, Google, etc.). As a consequence, the Llama 2 license also violates OSD 7. Distribution of License: “the rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties.”
- 1. No Discrimination Against Fields of Endeavor. If you can’t use Llama 2 to—for example—train another model, it’s by definition not open source. Their entire acceptable use policy is included by reference and contains a wide variety of sometimes ambiguous restrictions.
So, why does this matter?
1. As an open-source maintainer and PSF Fellow, I have no objection to the existence of commercially licensed software. I use much of it, and have sold commercial licenses for software that I’ve written too. However, people—and especially megacorps—misrepresenting their closed-off projects as open source is an infuriating form of parasitism on a reputation painstakingly built over decades.
2. The restriction on model training makes Llama 2 much less useful for AI safety research, but it incurs just as much direct (roughly all via misuse IMO) and acceleration risk as an open-source release.
3. Using a custom license adds substantial legal risk for prospective commercial users, especially given the very broad restrictions imposed by the acceptable use policy. This reduces the economic upside enormously relative to standard open terms, and leaves Meta’s competitors particularly at risk of lawsuits if they attempt to use Llama 2.
To summarize, Meta gets a better cost/benefit tradeoff by using a custom, non-open-source license especially if people incorrectly percieve it as open source; everyone else is worse off; and it seems to me like they’re deliberately misrepresenting what they’ve done for their own gain. This really, really annoys me.

When someone describes Llama 2 as “open source”, please correct them: Meta is offering a limited commercial license which discriminates against specific users and bans many valuable use-cases, including in alignment research.
What links here?

Zac Hatfield-Dodds 6 May 2021 1:02 UTC
45 points
on: Let’s Go Back To Normal
This approach to tradeoffs makes sense for the USA in 2021.

I just don’t want our analysis to lose sight of the fact that facing these tradeoffs is stupid and avoidable, and that almost every country could have done so much better. Avoiding outbreaks is so much cheaper and easier than dealing with them that the choice to do so should have been overdetermined.
- The background risk rate in Australia is roughly zero. We occasionally get “outbreaks” of single-digit cases, lock down one city for a few days to trace it, and then go back to normal.
- It’s not even worth wearing masks here.
- Australia is taking a (frustratingly) slow and cautious approach to the vaccine rollout. This will probably cost zero lives (though with a scary right-tail), plausibly saving some from avoided adverse reactions. IMO we should be going way faster on tail-risk and economic grounds, but...
TLDR: it’s much better to be careful instead of exponential growth than after it.

In Defence of Spock

Zac Hatfield-Dodds21 Apr 2021 21:34 UTC

35 points

5 comments1 min readLW link

Zac Hatfield-Dodds 17 Mar 2024 7:31 UTC
LW: 26 AF: 9
11
AF
in reply to: Tamsin Leake’s comment on: carado’s Shortform
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Zac Hatfield-Dodds 20 Aug 2022 17:52 UTC
22 points
13
on: The ‘Bitter Lesson’ is Wrong
I think you’ve misunderstood the lesson, and mis-generalized from your experience of manually improving results.

First, I don’t believe that you could write a generally-useful program to improve translations—maybe for Korean song lyrics, yes, but investing lots of human time and knowledge in solving a specific problem is exactly the class of mistakes the bitter lesson warns against.

Second, the techniques that were useful before capable models are usually different to the techniques that are useful to ‘amplify’ models—for example, “Let’s think step by step” would be completely useless to combine with previous question-answering techniques.

Third, the bitter lesson is not about deep learning; it’s about methods which leverage large amounts of compute. AlphaGo combining MCTS with learned heuristics is a perfect example of this:

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

Zac Hatfield-Dodds 15 Mar 2023 1:11 UTC
LW: 21 AF: 10
4
AF
in reply to: Gabriel Mukobi’s comment on: Anthropic’s Core Views on AI Safety
(Zac’s note: I’m posting this on behalf of Jack Clark, who is unfortunately unwell today. Everything below is his words.)
Hi there, I’m Jack and I lead our policy team. The primary reason it’s not discussed in the post is that the post was already quite long and we wanted to keep the focus on safety—I did some help editing bits of the post and couldn’t figure out a way to shoehorn in stuff about policy without it feeling inelegant / orthogonal.
You do, however, raise a good point, in that we haven’t spent much time publicly explaining what we’re up to as a team. One of my goals for 2023 is to do a long writeup here. But since you asked, here’s some information:
You can generally think of the Anthropic policy team as doing three primary things:
1. Trying to proactively educate policymakers about the scaling trends of AI systems and their relation to safety. Myself and my colleague Deep Ganguli (Societal Impacts) basically co-wrote this paper https://arxiv.org/abs/2202.07785 - you can think of us as generally briefing out a lot of the narrative in here.
2. Pushing a few specific things that we care about. We think evals/measures for safety of AI systems aren’t very good [Zac: i.e. should be improved!], so we’ve spent a lot of time engaging with NIST’s ‘Risk Management Framework’ for AI systems as a way to create more useful policy institutions here—while we expect labs in private sector and academia will do much of this research, NIST is one of the best institutions to take these insights and a) standardize some of them and b) circulate insights across government. We’ve also spent time on the National AI Research Resource as we see it as a path to have a larger set of people do safety-oriented analysis of increasingly large models.
3. Responding to interest. An increasing amount of our work is reactive (huge uptick in interest in past few months since launch of ChatGPT). By reactive I mean that policymakers reach out to us and ask for our thoughts on things. We generally aim to give impartial, technically informed advice, including pointing out things that aren’t favorable to Anthropic to point out (like emphasizing the very significant safety concerns with large models). We do this because a) we think we’re well positioned to give policymakers good information and b) as the stakes get higher, we expect policymakers will tend to put higher weight on the advice of labs which ‘showed up’ for them before it was strategic to do so. Therefore we tend to spend a lot of time doing a lot of meetings to help out policymakers, no matter how ‘important’ they or their country/organization are—we basically ignore hierarchy and try to satisfy all requests that come in at this stage.
More broadly, we try to be transparent on the micro level, but haven’t invested yet in being transparent on the macro. What I mean by that is many of our RFIs, talks, and ideas are public, but we haven’t yet done a single writeup that gives an overview of our work. I am hoping to do this with the team this year!
Some other desiderata that may be useful:
- I testified in the Senate last year and wrote quite a long written testimony.
- I talked to the Congressional AI Caucus; slides here. Note: I demo’d Claude but it’s worth noting that whenever I demo our system I also break it to illustrate safety concerns. IIRC here I jailbroke it so it would play along with me when I asked it how to make rabies airborne—this was to illustrate how loose some of the safety aspects of contemporary LLMs are.
- A general idea I/we push with policymakers is the need to measure and monitor AI systems; Jess Whittlestone and I wrote up a plan here which you can expect us to be roughly outlining in meetings.
- A NIST RFI that talks about some of the trends in predictability and surprise and also has some recommendations.
Our wonderful colleagues on the ‘Societal Impacts’ team led this work on Red Teaming and we (Policy) helped out on the paper and some of the research. We generally think red teaming is a great idea to push to policymakers re AI systems; it’s one of those things that is ‘shovel ready’ for the systems of today but, we think, has some decent chance of helping out in future with increasingly large models.
What links here?
- Zac Hatfield-Dodds's comment on Anthropic’s Core Views on AI Safety by Zac Hatfield-Dodds (15 Mar 2023 1:28 UTC; 2 points)

Zac Hatfield-Dodds 20 Sep 2022 22:32 UTC
20 points
5
on: Alignment Org Cheat Sheet
I think this is missing the point pretty badly for Anthropic, and leaves out most of the work that we do. I tried writing up a similar summary; which is necessarily a little longer:

Anthropic: Let’s get as much hands-on experience building safe and aligned AI as we can, without making things worse (advancing capabilities, race dynamics, etc). We’ll invest in mechanistic interpretability because solving that would be awesome, and even modest success would help us detect risks before they become disasters. We’ll train near-cutting-edge models to study how interventions like RL from human feedback and model-based supervision succeed and fail, iterate on them, and study how novel capabilities emerge as models scale up. We’ll also share information so policy-makers and other interested parties can understand what the state of the art is like, and provide an example to others of how responsible labs can do safety-focused research.

(Sound good? We’re hiring for both technical and non-technical roles.)

We were also delighted to welcome Tamera Lanham to Anthropic recently, so you could add her externalized reasoning oversight agenda to our alignment research too :-)
What links here?
- Shallow review of live agendas in alignment & safety by technicalities (27 Nov 2023 11:10 UTC; 307 points)
- Shallow review of live agendas in alignment & safety by Gavin (EA Forum; 27 Nov 2023 11:33 UTC; 76 points)

Zac Hatfield-Dodds 3 May 2021 7:14 UTC
18 points
on: There’s no such thing as a tree (phylogenetically)

Why don’t more plants evolve towards the “grass” strategy?

I suspect it’s related to the distinction between C3 and C4 photosynthesis—both are common in grasses and C4 species tend to do better in hot climates, but trees seem to have trouble evolving C4 pathways even though that happened on 60+ separate occasions.

(also IMO monocots top out at “kinda tree-ish”—they do have a recognisable trunk, but more fibrous than woody)

Zac Hatfield-Dodds 19 Nov 2021 2:49 UTC
17 points
in reply to: mukashi’s comment on: “Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time
The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I’d suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:
- Do neural networks learn the same concepts as humans, or at least human-legible concepts? A “yes” would be good news for interpretability (and alignment). Let’s investigate AlphaZero and Chess as a case study!
- Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.
- Low-level / ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. “Stockfish internals”).
- Details about where in the network and when in the training process things are represented and learned.
The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.

Zac Hatfield-Dodds 19 Oct 2023 20:50 UTC
16 points
12
on: AI #34: Chipping Away at Chip Exports

Adam Jermyn says Anthropic’s RSP includes fine-tuning-included evals every three months or 4x compute increase, including during training.

You don’t need to take anyone’s word for this when checking the primary source is so easy: the RSP is public, and the relevant protocol is on page 12:

In more detail, our evaluation protocol is as follows: … Timing: During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements.

Zac Hatfield-Dodds 7 Mar 2023 8:49 UTC
15 points
4
in reply to: Jay Bailey’s comment on: Introducing Leap Labs, an AI interpretability startup
Relatedly, have you considered organizing the company as a Public Benefit Corporation, so that the mission and impact is legally protected alongside shareholder interests?

Zac Hatfield-Dodds 12 Nov 2021 4:56 UTC
LW: 15 AF: 1
AF
in reply to: Vanessa Kosoy’s comment on: Discussion with Eliezer Yudkowsky on AGI interventions
First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.

Zac Hatfield-Dodds 20 Aug 2021 8:15 UTC
15 points
on: Misguided? Callous? Just Plain Stupid?

For the sake of readability, I have referred to misguidedness, callousness, and stupidity as type one, two, and three traits respectively

For what it’s worth I find descriptive terms much easier to read than type one, two, etc. In statistics I even dislike “false positive/negative”, and prefer the more descriptive “false/missed alarm/label/...”.

Zac Hatfield-Dodds(Zac Hatfield-Dodds)

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

An­thropic’s Core Views on AI Safety

Con­crete Rea­sons for Hope about AI

An­thropic’s Re­spon­si­ble Scal­ing Policy & Long-Term Benefit Trust

Dario Amodei’s pre­pared re­marks from the UK AI Safety Sum­mit, on An­thropic’s Re­spon­si­ble Scal­ing Policy

In Defence of Spock

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Anthropic’s Core Views on AI Safety

Concrete Reasons for Hope about AI

Anthropic’s Responsible Scaling Policy & Long-Term Benefit Trust

Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy