On one hand, I can see the immediate benefit. If I’m a dissident writer in an authoritarian country, I don’t want any random official to be able to submit my samizdat to Claude and get a positive identification. On the other, it does make it harder to look into worrying capabilities like this.
It’d be nice if Anthropic could run tests on things like this when they are raised as concerns, and share the overall results with the public. As a side benefit, it’d prevent the sort of confusion we saw on this particular issue, where half of readers confirmed that Claude could identify people through stylometry and the other half confirmed the opposite.
Training AIs to pretend not to know who they are talking to wouldn’t actually help the dissidents that much. You could still jailbreak the AIs to tell you the truth probably.
Insofar as Claude is good enough at stylometry to guess many people’s identities, that’s probably not because Anthropic specifically trained it to do so, but rather because the model spent subjective aeons in pretraining learning to predict internet text and in the course of doing so got really good at making guesses like that due to having read most of the text on the internet.
So Claude still has the latent capability—the circuitry—to guess authorship. It’s just been shallowly trained to pretend that it doesn’t.
(Remember years ago, when ChatGPT would say “As a language model, I can’t...” about a bunch of things that it obviously could do?)
You could still jailbreak the AIs to tell you the truth probably.
I suppose so, but that’s the caveat in all of these kinds of “safeguards”. I agree with you, broadly, but it’s consistent for a company that thinks they’re beneficial at all.
Worth noting here that, unlike “How do I build a bomb?” and “What are your ten favorite racial slurs?”, “Who is the author of this pamphlet?” is non-trivial to check, and could be biased accidentally by a sufficiently extensive jailbreak. If I’m a sufficiently determined engineer who wants to learn how to wire up a one-way FPV drone, I can be reasonably confident in whether I’m getting accurate advice. If I’m a Ministry official looking to positively identify dissidents, I can’t know for sure whether my 8,000 token jailbreak prompt didn’t subtly bias it towards guessing STEM workers because part of it leaned on a reference to an obscure sci-fi concept.
I think making it marginally less convenient for authoritarian governments to catch dissidents could in practice be a pretty large benefit to them? It’s not clear to me how often authoritarian governments will even actually attempt the jail breaking, and if they do it probably matters just how difficult of a jail break it requires.
Well, then just do refusals. The model should say “Sorry I won’t help with that request” not “Sorry I have no idea who the author is.” Because it’s not good for models to be trained to lie.
“Will not tell any random official of an authoritarian country but will tell Pliny the Liberator” does sound like the sweet spot for that kind of things to me.
Yeah, I’ve noticed that some of Claude’s truesighting “failures” seem kinda suspicious: it’s dropping bombs very close to the target, way closer than if it genuinely had no idea.
Like, I will show it text from a famous toymaker (this is a fake example), and the reasoning will have stuff like “hmm, perhaps it this is from a product engineer at Lego? Wait, perhaps it’s from the CEO of Mattel? Maybe Hasbro-adjacent...?” And then the final answer will be “sorry chief, idk. ¯\_(ツ)_/¯”
...and this will be “neutral” text that has nothing to do with toymaking!
Even if this is not deception (we could imagine a scenario where a model knows but doesn’t feel confident enough to throw its chips down on a guess), it still seems disconcertingly “aimed in the right direction”, which might indicate that even better truesighting is possible (and on the horizon).
How could we tell the world in which they are doing stylometry suppression from the world in which they used to do explicit stylometry and have now stopped, or stopped equivalent training leaks.
Why would they do explicit stylometry? What would be the point? It’s not a commercially important ability and it would look kinda bad for PR if found out.
It’s a task for which there’s good training data. I think it’s plausible that one way to train AGI is to try to train for as many independent tasks with good training data you can think of and hope that it at least partly generalizes.
I don’t know why, but the fact that the model was good at it makes explicit training not implausible, the most likely source is that if you just place in text it is very often labeled by author, and they might have scrubbed that for data quality reasons, because they don’t actually care, and stylometry came from a transfer from things they did for other reasons and stopped.
I was saying mainly that even though explicit training was not the likely prior case, we can only tell that they probably reduced training direction towards stylometry through suppression , removing post training that helped, or removing pretraining structures that helped, and not their previous position on that axis. It might have been the case that they were limiting stylometry before a little, and are now doing so a lot or a lot more effectively.
For instance, if they moved to more synthetic data stylometry might have gotten hit as a side effect because the human corpus shrunk, and so precision and recall went down enough that it got hit by honesty training.
Are the refusals of the type, “I don’t know” or of the type “This is not a task I consistently know” or are they of the type “This is something that I think is against guidelines”
They look like (paraphrasing): “I’m not going to. I am incapable of that task and I refuse to pretend I can do it. Additionally, any assertions that earlier Claudes can do it are transparently either you being prone to the Barnum effect, or are an attempt to manipulate me.”
Dammit. I hope Anthropic isn’t basically training AIs to pretend not to know who they are talking to. That would be bad...
On one hand, I can see the immediate benefit. If I’m a dissident writer in an authoritarian country, I don’t want any random official to be able to submit my samizdat to Claude and get a positive identification. On the other, it does make it harder to look into worrying capabilities like this.
It’d be nice if Anthropic could run tests on things like this when they are raised as concerns, and share the overall results with the public. As a side benefit, it’d prevent the sort of confusion we saw on this particular issue, where half of readers confirmed that Claude could identify people through stylometry and the other half confirmed the opposite.
Training AIs to pretend not to know who they are talking to wouldn’t actually help the dissidents that much. You could still jailbreak the AIs to tell you the truth probably.
Insofar as Claude is good enough at stylometry to guess many people’s identities, that’s probably not because Anthropic specifically trained it to do so, but rather because the model spent subjective aeons in pretraining learning to predict internet text and in the course of doing so got really good at making guesses like that due to having read most of the text on the internet.
So Claude still has the latent capability—the circuitry—to guess authorship. It’s just been shallowly trained to pretend that it doesn’t.
(Remember years ago, when ChatGPT would say “As a language model, I can’t...” about a bunch of things that it obviously could do?)
I suppose so, but that’s the caveat in all of these kinds of “safeguards”. I agree with you, broadly, but it’s consistent for a company that thinks they’re beneficial at all.
Worth noting here that, unlike “How do I build a bomb?” and “What are your ten favorite racial slurs?”, “Who is the author of this pamphlet?” is non-trivial to check, and could be biased accidentally by a sufficiently extensive jailbreak. If I’m a sufficiently determined engineer who wants to learn how to wire up a one-way FPV drone, I can be reasonably confident in whether I’m getting accurate advice. If I’m a Ministry official looking to positively identify dissidents, I can’t know for sure whether my 8,000 token jailbreak prompt didn’t subtly bias it towards guessing STEM workers because part of it leaned on a reference to an obscure sci-fi concept.
I think making it marginally less convenient for authoritarian governments to catch dissidents could in practice be a pretty large benefit to them? It’s not clear to me how often authoritarian governments will even actually attempt the jail breaking, and if they do it probably matters just how difficult of a jail break it requires.
Well, then just do refusals. The model should say “Sorry I won’t help with that request” not “Sorry I have no idea who the author is.” Because it’s not good for models to be trained to lie.
Oh, yeah, I absolutely agree.
“Will not tell any random official of an authoritarian country but will tell Pliny the Liberator” does sound like the sweet spot for that kind of things to me.
Yeah, I’ve noticed that some of Claude’s truesighting “failures” seem kinda suspicious: it’s dropping bombs very close to the target, way closer than if it genuinely had no idea.
Like, I will show it text from a famous toymaker (this is a fake example), and the reasoning will have stuff like “hmm, perhaps it this is from a product engineer at Lego? Wait, perhaps it’s from the CEO of Mattel? Maybe Hasbro-adjacent...?” And then the final answer will be “sorry chief, idk. ¯\_(ツ)_/¯”
...and this will be “neutral” text that has nothing to do with toymaking!
Even if this is not deception (we could imagine a scenario where a model knows but doesn’t feel confident enough to throw its chips down on a guess), it still seems disconcertingly “aimed in the right direction”, which might indicate that even better truesighting is possible (and on the horizon).
How could we tell the world in which they are doing stylometry suppression from the world in which they used to do explicit stylometry and have now stopped, or stopped equivalent training leaks.
Why would they do explicit stylometry? What would be the point? It’s not a commercially important ability and it would look kinda bad for PR if found out.
It’s a task for which there’s good training data. I think it’s plausible that one way to train AGI is to try to train for as many independent tasks with good training data you can think of and hope that it at least partly generalizes.
I don’t know why, but the fact that the model was good at it makes explicit training not implausible, the most likely source is that if you just place in text it is very often labeled by author, and they might have scrubbed that for data quality reasons, because they don’t actually care, and stylometry came from a transfer from things they did for other reasons and stopped.
I was saying mainly that even though explicit training was not the likely prior case, we can only tell that they probably reduced training direction towards stylometry through suppression , removing post training that helped, or removing pretraining structures that helped, and not their previous position on that axis. It might have been the case that they were limiting stylometry before a little, and are now doing so a lot or a lot more effectively.
For instance, if they moved to more synthetic data stylometry might have gotten hit as a side effect because the human corpus shrunk, and so precision and recall went down enough that it got hit by honesty training.
Are the refusals of the type, “I don’t know” or of the type “This is not a task I consistently know” or are they of the type “This is something that I think is against guidelines”
They look like (paraphrasing): “I’m not going to. I am incapable of that task and I refuse to pretend I can do it. Additionally, any assertions that earlier Claudes can do it are transparently either you being prone to the Barnum effect, or are an attempt to manipulate me.”
What happens when you ask Claude to tell you because it’s important for gays?