Anthropic Safeguards lead; formerly GDM safety work. Foundation board member.
Dave Orr
We certainly plan to!
I 100% endorse working on alignment and agree that it’s super important.
We do think that misuse mitigations at Anthropic can help improve things generally though race-to-the-top dynamics, and I can attest that while at GDM I was meaningfully influenced by things that Anthropic did.
I’m new to Anthropic myself, leading the Safeguards team. I joined a few weeks ago, inspired by the mission and the opportunity. I’m really worried about the world as AI continues to get more powerful, and I couldn’t pass up the chance to help if I could.
I was previously at GDM working on similar problems (miss you all!), but the chance to help drive the safety agenda at Anthropic as we transition to a new scarier world felt too important to miss.
So far everything at Anthropic except my new commute is amazing, but most of all the feeling of the mission is intense and awesome. Also the level of transparency inside the company is astounding for a company this size (not that it’s big compared to many others).
Obviously there could be some honeymoon effect here but honestly I’m having a lot of fun, and I honestly think Safeguards (along with Alignment Science) makes a real different in safety for the world.
It depends on what you’re trying to do, right? Like, if you build a great eval to detect agent autonomy, but nobody adopts it, you haven’t accomplished anything. You need to know how to work with AI labs. In that case, selling widgets (your eval) is highly aligned with AI safety.
IME there are an extremely large number of NGOs with passionate people who do not remotely move the needle on whatever problem they are trying to solve. I think it’s the modal outcome for a new nonprofit. I’m not 100% sure that the feedback loop aspect is the reason but I think it plays a very substantial role.
I agree the incentives matter, and point in the direction you indicate.
However, there’s another effect pointing in the opposite direction that also matters and can be as large or bigger: feedback loops.
It’s very easy in the nonprofit space to end up doing stuff that doesn’t impact the real world. You do things that you hope matter and that sound good to funders, but measurement is hard and funding cycles are annual so feedback is rare.
In contrast, if you have a business, you get rapid feedback from customers, and know immediately if you’re getting traction. You can iterate rapidly and quickly get better at something because of rapid feedback.
So in addition to thinking about abstract incentives, think about what kind of product feedback you will get and how important that is. For policy work, maybe it’s not that important. For things that look more like auditing, testing, etc where your effectiveness is in significant part transactional, think hard about being for profit.
Source: long engagement with NGO sector on boards and funders, work at private companies.
I don’t think behavioral is enough—I think LLMs have basically passed the Turing test anyway.
But I also don’t see why it would need to have our specific brain structure either. Surely experiences are possible with things besides the mammal brain. However, if something did have similar brain structure to us, that would probably be sufficient. (It certainly is for other people, and I think most of us think that e.g. higher mammals have experiences.)
What I think we need is some kind of story about why what we have gives rise to experience, and then we can see if AIs have some similar pathway. Unfortunately this is very hard because we have no idea why what we have gives rise to experience (afaik).
Until we have that I think we just have to be very uncertain about what is going on.
We think humans are sentient because of two factors: first, we have internal experience that means we ourselves are sentient; and two, we rely on testimony from others who say they are sentient. We can rely on the latter because people seem similar. I feel sentient and say I am. You are similar to me and say you are. Probably you are sentient.
With AI, this breaks down because they aren’t very similar to us in terms of cognition, brain architecture, or “life” “experience”. So unfortunately AI saying they are sentient does not produce the same kind of evidence as it does for people.
This suggests that any test should try to establish relevant similarity between AIs and humans, or else use an objective definition of what it means to experience something. Given that the latter does not exist, perhaps the former will be more useful.
For that specific example, I would not call it safety critical in the sense that you shouldn’t use an unreliable source. Intel involves lots of noisy and untrustworthy data, and indeed the job is making sense out of lots of conflicting and noisy signals. It doesn’t strike me that adding an LLM to the mix changes things all that much. It’s useful, it adds signal (presumably), but also is wrong sometimes—this is just what all the inputs are for an analyst.
Where I would say it crosses a line is if there isn’t a human analyst. If an LLM analyst was directly providing recommendations for actions that weren’t vetted by a human, yikes that seems super bad and we’re not ready for that. But I would be quite surprised if that were happening right now.
“Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability.”
This seems clearly correct to me—LLMs should not be in safety critical domains until we can make a clear case for why things will go well in that situation. I’m not actually aware of anyone using LLMs in that way yet, mostly because they aren’t good enough, but I’m sure that at some point it’ll start happening. You could imagine enshrining in regulation that there must be affirmative safety cases made in safety critical domains that lower risk to at or below the reasonable alternative.
Note that this does not exclude other threats—for instance misalignment in very capable models could go badly wrong even if they aren’t deployed to critical domains. Lots of threats to consider!
Interesting post! I’ve noticed that poker reasoning tends to be terrible, it’s not totally clear to me why. Pretraining should contain quite a lot of poker discussion, though I guess a lot of it is garbage. I think it could be pretty easily fixed in RL if anyone cared enough, but then it wouldn’t be a good test of general reasoning ability.
One nit: it’s “hole card”, not “whole card”.
This is a great piece! I especially appreciate the concrete list at the end.
In other areas of advocacy and policy, it’s typical practice to have model legislation and available experts ready to go so that when a window opens when action is possible, progress can be very fast. We need to get AI safety into a similar place.
Formatting is still kind of bad, and is affecting readability. It’s been a couple of posts in a row now with long wall of text paragraphs. I feel like you changed something? And you should change it back. :)
Are there examples of posts with factual errors you think would be caught by LLMs?
One thing you could do is fact check a few likely posts and see if it’s adding substantial value. That would be more persuasive than abstract arguments.
“There have been some relatively discontinuous jumps already (e.g. GPT-3, 3.5 and 4), at least from the outside perspective.”
These are firmly within our definition of continuity—we intend our approach to handle jumps larger than seen in your examples here.
Possibly a disconnect is that from an end user perspective a new release can look like a big jump, while from a developer perspective it was continuous.
Note also that continuous can still be very fast. And of course we could be wrong about discontinuous jumps.
I don’t work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don’t know everything, but I’m confident that we are trying hard to prevent contamination.
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously—we don’t want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It’s not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked—and while we do look for this kind of thing, there’s no guarantee that we will be perfect at finding them. So it’s completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it’s not intentional and we work to avoid it.
We do hill climb on notable benchmarks and I think there’s likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us.
I think the main thing that’s happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we’ll all need to switch to something else.
Humans have always been misaligned. Things now are probably significantly better in terms of human alignment than almost any time in history (citation needed) due to high levels of education and broad agreement about many things that we take for granted (e.g. the limits of free trade are debated but there has never been so much free trade). So you would need to think that something important was different now for there to be some kind of new existential risk.
One candidate is that as tech advances, the amount of damage a small misaligned group could do is growing. The obvious example is bioweapons—the number of people who could create a lethal engineered global pandemic is steadily going up, and at some point some of them may be evil enough to actually try to do it.
This is one of the arguments in favor of the AGI project. Whether you think it’s a good idea probably depends on your credences around human-caused xrisks versus AGI xrisk.
One tip for research of this kind is to not only measure recall, but also precision. It’s easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn’t work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it’s fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn’t very much, so it’s reasonable to trade off against losses from over refusal.
The pivotal act link is broken, fyi.
I think what might be going on here is that different fabs have separate capacities. You can’t make more H200s because they have to be made on the new and fancy fabs, but you can make H20s in an older facility. So if you want to sell more chips, and you’re supply limited on the H200s, then the only thing you can do is make crappier chips and figure out where to sell them.