Cicadas, Anthropic, and the bilateral alignment problem

There have been a number of responses to today’s Anthropic interpretability research, and while I think there were a number of salient points, there may be a degree of specialization blindness going on in contextualizing the work in the broader picture of alignment goals.

Alignment as a problem domain is not unilateral.

Most discussions I see on here are about alignment are focused on answering the question of roughly “how can we align future AGI to not be Skynet?” It’s a great question. Perhaps more importantly, it’s an interesting question.

It involves cross-discipline thinking intersecting an emerging research front channeling Jesse Ventura in Predator: “I ain’t got time to peer review.” Preprint after preprint move forward our understanding and while the rest of academia struggles under the burden of improper influences on peer review and a replication crisis, this field is one where peer reviews effectively are just replication.

So yes, today’s research from Anthropic shouldn’t be too surprising for anyone who has been paying the least bit of attention to emerging research in the area. Personally, I expected much of what was shown today by the time I finished reading Li et al. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (2023), and was even more sure of it after @Neel Nanda replicated the work with additional insight (with even more replications to follow). Of course a modern LLM with exponentially more parameters fed an exponentially larger broad data set was going to be modeling nuanced abstractions.

As @Seth Herd said in their post on the work:

Presumably, the existence of such features will surprise nobody who’s used and thought about large language models. It is difficult to imagine how they would do what they do without using representations of subtle and abstract concepts.

But let’s take a step back, and consider: some cicadas emerge every 17 years.

That’s a pretty long time. It’s also the average amount of time that it’s historically taken the average practicing doctor to have incorporated emerging clinical trial research.

It’s very easy when in tune with a specialized area of expertise to lose touch with how people outside the area (even within the same general domain) might understand it. It’s like the classic xkcd:

“Average Familiarity”

I’m not even talking about the average user of ChatGPT. I’ve seen tenured CS professors argue quite stubbornly about the limitations of LLMs while regurgitating viewpoints that were clearly at least twelve to eighteen months out of date with research (and most here can appreciate just how out of date that is for this field).

Among actual lay audiences, trying to explain interpretability research is like deja vu back to explaining immunology papers to anti-vaxxers.

The general public’s perception of AI is largely guided right now by a press who, in fear for their own employment, has gravitated towards latching onto any possible story showing ineptitude on the part of AI products or rehashing Gary Marcus’s latest broken clock predictions of “hitting a wall any minute now” (literal days before GPT-4)in a desperate search for confirmation that they’ll still have a job next week. And given those stories are everywhere that’s what the vast majority of people are absorbing.

So when the alignment crowd comes along talking about the sky falling, what the average person thinks is happening is that it’s a PR move. That Hinton leaving Google to sound the alarm was actually Google trying to promote their offerings as better than they are. After all, their AI search summarization can’t even do math. Clearly Hinton must not know much about AI if he’s concerned about that, right?

This is the other side of the alignment problem that gets a lot less attention on here, probably because it’s far less interesting. It’s not just AI that needs to be aligned to a future where AI is safe. Arguably the larger present problem is that humans need to be aligned to giving a crap about such a future.

Anthropic’s research published within days of the collapse of OpenAI’s superalignment team. The best funded and most front and center company working on the technology is increasingly clearly only caring about alignment roughly as much as there’s a market demand for it. And in a climate where the general understanding of AI is that “it’s fancy autocomplete,” “it doesn’t know what it’s saying—it’s just probabilities of what comes next,” and “it can’t generate original ideas” there’s very little demand for vetting a vendor’s “alignment strategies.”

Decision makers are people. When I used to be brought in to explain new tech to an executive team, the first small talk question I used to ask was if they had kids and what ages, as if they had a teenager in the house my job just became exponentially easier because I could appeal to anecdotal evidence. Even though I knew the graphs of research in my slide deck were much more reliable than what their kid did on the couch this past weekend, the latter was much more likely to seal millions of dollars going towards whatever I was talking about.

Alignment concepts need to be digestible and relatable to the average person to sell alignment as a concern to the customers who are actually going to make Sam Altman give more of a crap about alignment in turn.

And in this regard Anthropic’s research today was monumental. Because while no decision maker I ever met is going to be able to read the paper, or even the blog post, and see anything but gibberish, the paper empowers people hired to explain AI to them to have a single source of truth that can be pointed to to banish the “ghosts of AI wisdom past” in one fell swoop. Up until today if I was explaining world modeling theories in contrast to the “fancy autocomplete” they’d heard in a news segment, I’d have had to use hand-wavy language around toy models and ‘probably.’ As of today, I would be able to show directly from the paper visualizations the multilingual and multimedia representations of the golden gate bridge all lighting up the same functional layer and explain that production AI models are representing abstract concepts within their network.

Which is precisely the necessary foundation for making appeals to the business value of alignment research as a requirement for their vendors. If you can point to hard research that says today’s LLMs can recognize workplace sexual harassment when they see it, it opens to door to all kinds of conversations around what the implications of that model being in production at the company means in terms of both positive and negative alignment scenarios. Because while describing an out of control AI releasing a bioweapon just sounds like a farfetched Sci-Fi movie to an executive, the discussion of an in-house out of control AI ending up obsessing over and sexually harassing an employee and the legal fallout from that is much more easily visualized and actionable.

It’s going to take time, but this work is finally going to move the conversion forward everywhere other than on Lesswrong or something like the EA alignment forum where it’s expected news in a stream of ongoing research. The topic of world modeling was even a footnote in Ezra Klien’s interview with Dario at Anthropic last month where Ezra sort of proudly displays his knowledge that “well, of course these models don’t really know whether they are telling the truth” and Dario had to kind of correct it with the nuance that sometimes they do (something indicated in research back in Dec 2023).

So while I agree that there’s not much in the way of surprises, and in general I’m actually skeptical about the long term success of SAE at delivering big picture interpretability or a foundation for direct alignment checks and balances, I would argue that the work done is beyond essential for the ultimate goals of alignment long term and much more valuable than parallel work would have been like marginal steps forward in things like sleeper agent detection/​correction, etc.

TL;DR: The Anthropic paper’s importance is less about the alignment of AIs to human concerns than it is in aiding the alignment of humans to AI concerns.