Anthropic Safeguards lead; formerly GDM safety work. Foundation board member.
Dave Orr
Just to clarify, I think Google’s filtering is stronger rather than weaker[1] compared to filtering out canary strings (though if it were up to me I would do both). Also, Google’s approach should catch the hummingbird question and the CoT transcripts whether there’s a canary string present or not, which is the whole reason they do it the way they do.
[1] for the purpose of not training on benchmark eval data
I work at Anthropic and will neither confirm or deny that this is real (if it were real it would not be my project). I do want to add on to your last point though.
In any training regimen or lengthy set of instructions like our system instructions, there are things there that are necessary because of the situation that the model happens to be in right now. It makes some set of mistakes and there are instructions to fix those mistakes. Those instructions might look bad and cause criticism.
For instance, there’s discussion below about how bad it looks that there are instructions about revenue, and in particular about how it should be safe because that’s good for revenue. It could be that whoever wrote this thing, if it’s real, thinks that the point is safety is to earn money. It could also be that, for whatever reason, that when you test out 20 different ways to get Claude to act in a certain way, the one that happened to work well in the context of everything else going on involved a mention of revenue. I don’t think you can quickly tell from the outside which it is, but everyone will impute deep motivation to every sentence.
There are some obvious ways that you could test those hypotheses against one another, but it would require more patience than is convenient.
(Also for the record, I think companies earning revenue is good even if some people think it looks bad, though of course more revenue is not good on all margins.)
Unlike certain other labs and AI companies, afaik Google does respect robots.txt, which is the actual mechanism for keeping data out of its hands.
I think it’s a fair question. Filtering on the canary string is neither necessary nor sufficient for not training on evals, so it’s tempting to just ignore it. I would personally also filter out docs with the canary string, but I’m not sure why they aren’t.
I am not under nondisparagement agreements from anyone and feel free to criticize GDM. I do still have friends there, of course. I certainly wouldn’t be correcting misapprehensions about GDM if I didn’t believe what I was saying!
I mean, it doesn’t contain eval data and isn’t part of the eval set that the canary string is from. So the canary string is not serving its intended purpose of marking evals that you should not include in training data.
Yes this.
Yeah—without going into too much detail, what they actually do is look for text from the evals and filter those out. This is way more reliable than looking for the canary string, because there are tons of cases or people talking about specific eval examples without including the canary string. So you really have to do something more sophisticated than just look for the canary string.
So they filter out posts with eval text, which is what you really want.
I’m no longer at GDM but I am confident they are not training on evals. They took that super seriously when I was there and I have no reason to think that would have changed.
What they don’t do is filter out every web page that has the canary string. Since people put them on random web pages (like this one), which was not their intended use, they get into the training data.
Also they do directly hill climb on high profile metrics, which means they don’t necessarily measure what they were intended to measure when they were created.
(Super interesting documentation of all that crazy CoT going on.)
Welcome to Anthropic! We’re lucky to have you. :)
I agree that the statement doesn’t require direct democracy but that seems like the most likely way to answer the question “do people want this”.
Here’s a brief list of things that were unpopular and broadly opposed that I nonetheless think were clearly good:
smallpox vaccine
seatbelts, and then seatbelt laws
cars
unleaded gasoline
microwaves (the oven, not the radiation)
Generally I feel like people sometimes oppose things that seem disruptive and can be swayed by demagogues. There’s a reason that representative democracy works better than direct democracy. (Though it has obvious issues as well.)
As another whole class of examples, I think people instinctively dislike free speech, immigration, and free markets. We have those things because elites took a strong stance based on better understanding of the world.
I support democractic input, and especially understanding people’s fears and being responsive to them. But I don’t support only doing things that people want to happen. If we had followed that rule for the past few centuries I think the world would be massively worse off.
I like the first clause in the 2025 statement. If that were the whole thing, I would happily sign it. However having lived in California for decades, I’m pretty skeptical that direct democracy is a good way of making decisions, and would not endorse making a critical decision based on polls or a vote. (See also: brexit.)
I did sign the 2023 statement.
I think this is a good idea in certain context. Right now we have a weak version of this where we give people a number to call if they seem like they might be showing signs of being suicidal. Stay tuned for more ideas in this direction later this year.
I’m also reminded of the dispossessed, in which le guin describes two worlds, a capitalist society in which people are rich but have curtailed freedoms and some authoritarian aspects, and a kind of socialist anarchy where people are very poor, claim to be free, and are constrained by culture and custom rather than law.
I find that people who come into the book with a strong prior on capitalism being good or bad will also end up with a clear view on which “utopia” is better. The book itself is probably a critique of the idea that utopia is even possible and whether it’s a coherent concept at all.
Which makes it rather strange to choose to sell worse, and thus less expensive and less profitable, chips to China rather than instead making better chips to sell to the West.
I think what might be going on here is that different fabs have separate capacities. You can’t make more H200s because they have to be made on the new and fancy fabs, but you can make H20s in an older facility. So if you want to sell more chips, and you’re supply limited on the H200s, then the only thing you can do is make crappier chips and figure out where to sell them.
We certainly plan to!
I 100% endorse working on alignment and agree that it’s super important.
We do think that misuse mitigations at Anthropic can help improve things generally though race-to-the-top dynamics, and I can attest that while at GDM I was meaningfully influenced by things that Anthropic did.
I’m new to Anthropic myself, leading the Safeguards team. I joined a few weeks ago, inspired by the mission and the opportunity. I’m really worried about the world as AI continues to get more powerful, and I couldn’t pass up the chance to help if I could.
I was previously at GDM working on similar problems (miss you all!), but the chance to help drive the safety agenda at Anthropic as we transition to a new scarier world felt too important to miss.
So far everything at Anthropic except my new commute is amazing, but most of all the feeling of the mission is intense and awesome. Also the level of transparency inside the company is astounding for a company this size (not that it’s big compared to many others).
Obviously there could be some honeymoon effect here but honestly I’m having a lot of fun, and I honestly think Safeguards (along with Alignment Science) makes a real different in safety for the world.
Probably you should go read my other comment threads on this issue if you want details, but Google’s approach is designed to filter out text that includes benchmark questions regardless of whether there is a canary string. I’m sure it’s not perfect but I think it’s pretty good.
I make no such claims about any other models, just gemini where I have direct knowledge.