Sam Marks

Karma: 4,632

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

40 points

0 comments4 min readLW link

(alignment.anthropic.com)

Sam Marks 24 Nov 2025 2:14 UTC
4 points
0
in reply to: David Johnston’s comment on: Natural emergent misalignment from reward hacking in production RL
Oh, to be clear, the honest-only persona training did have a positive effect, just a weaker effect from the baseline; editing to clarify that. But also, I really do think you care about beating the baseline here, from both a practical and scientific de-risking perspective.
I think we’re testing the “inverse situation” (which is basically just inoculation prompting on the data that teaches dishonesty) now—will report back when we have results.

Sam Marks 23 Nov 2025 23:55 UTC
11 points
2
in reply to: nostalgebraist’s comment on: Natural emergent misalignment from reward hacking in production RL
I agree with basically everything here! Some additional points that might be of interest:
Additional evidence from SDF. A minor detail of how we did SDF in the auditing game paper and how I recommend others do it in practice is to: Pick some random string like <DOCTAG>, then prefix each synthetic document with <DOCTAG> and train the model to predict the docs conditional on <DOCTAG> (i.e. mask out gradients to the <DOCTAG> prefix). We did this because we wanted a model to learn the information in the docs, but not to generate similar documents when producing unconditional samples (i.e. sampling from an empty prompt). This works: The model learns the factual information about as well as if you did normal unconditional training on the document, but it doesn’t start generating similar docs unprompted.
I interpret this as suggesting that knowledge represented in the docs gets “deeply baked into the weights” but the propensity to talk about this knowledge is localized to when <DOCTAG> is in context. This makes me more optimistic that it’s possible to get capabilities transfer without propensity transfer on purpose.
On evidence from assistant personas.
Insofar as this ideal is realized in actual assistants like Claude, this constitutes a massive success of “quarantining propensities while preserving capabilities”: if there’s some part of the pretraining data where somewhere talked knowledgeably about some topic X, then Claude automatically inherits that knowledge about X, yet Claude still only acts like Claude, not like the original X-knower.
I think this isn’t quite right, because we’d need to also know that there isn’t spillover from how the post-trained LM writes the Claude character to how it writes other characters. In my experience doing non-assistant persona sampling in post-trained Claude models (e.g. sampling text in the user turn of a chat transcript or sampling continuations of non-chat-formatted documents), I do often feel like the samples do sometimes sound Claude-y, which suggests some spillover of propensities.
Some negative results. In some forthcoming work (out in the next few days, hopefully), we’ll report negative results on trying to teach models to have “honest-only personas.” That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant’s knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, this worked only partially: the dishonest propensity transferred somewhat, and this method overall failed to beat a baseline of just training the assistant to be honest using the same generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.
I’m curious if this result seems surprising to you. I thought it was surprising (I thought honest-only persona training would “just work”) and updated me towards cherry-picking capabilities without propensities being more difficult than I expected.
Combining this with SDF. Yes! I definitely agree that this is the next thing to try, and we have some experiments planned here :)

Sam Marks 21 Nov 2025 23:04 UTC
10 points
3
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
TBC, I don’t think that the abstraction of a “persona” is especially load-bearing for this strategy; it’s just an intuitive way to explain the idea. (Which is good, because I don’t really think the “persona” abstraction really carves reality at its joints or anything.) The core claim is that, if you can guess which data is inducing some propensity, you can quarantine that propensity to a certain subdistribution (e.g. the distribution of inputs that start with Genie) by making sure that all data inducing that behavior is on that subdistribution. The really bold claim is that it’s possible to quarantine propensities in this way without simultaneously quarantining capabilities.
I don’t think this can be true in full general, e.g. because some propensities like “being really persistent” are tied up with capabilities. But I think it could be true often enough to be useful. I also note that it sometimes seems to happen “by accident” e.g. in section 5.2.2 here we seem to get transfer of knowledge without transfer of dishonest propensity, so the question is whether we can get it reliably on purpose.

Sam Marks 21 Nov 2025 22:40 UTC
18 points
6
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
I agree that if you fail to properly bracket off many of the settings that encourage malign propensities then things look rough.
Here’s a speculative modification of the idea I’ve been thinking about lately that is supposed to be responsive to this concern.
- Partition your training environments into two sets:
  - Easy-to-oversee environments where you are extremely confident that if the model appears to behave well then it is definitely actually behaving well
  - Hard-to-oversee environments where you are not extremely confident that “looks good” → “is good.”
- Train your model to have two “assistant personas”: a trusted one and an untrusted one. (You can imagine that you introduce some special formatting (e.g.<|start|>Angel<|message|>... vs.<|start|>Genie<|message|>... to clearly delineate which one is “talking” during a given transcript).
  - The trusted persona is trained only on the easy-to-oversee environments
  - The untrusted persona is trained on the hard-to-oversee environments
- When you actually deploy the model, you only deploy the trusted persona.
The hope is that the trusted persona has access to the capabilities learned by the untrusted persona but that there is no untrusted->trusted spillover of malign propensities.
Here’s a slide illustrating this idea; it’s taken from this talk where the second-last slide has some musings on why this might be possible.
To be clear, training-gaming is also consistent with high performance on the easy-to-oversee data, so this idea alone doesn’t induce direct training pressure against training-gaming. The point is just to “quarantine” our training pressure towards training-gaming inside the untrusted persona, to give us as good a shot at possible of getting good generalization. The hope is that the trusted persona is:
- no less likely to be aligned than a model that was trained only on the easy-to-oversee data
- while also being more capable than a model that was trained only on the easy-to-oversee data.

Sam Marks 21 Nov 2025 22:09 UTC
22 points
12
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
telling the model “oh, it’s OK for you to sometimes subtly manipulate us into giving you more power, that’s expected” then like, that’s not really true, we indeed can’t really tell the AI it’s fine to disempower us).
To give an analogy, many humans enjoy playing social deception games like mafia that involve taking actions that would generally be considered immoral (like lying, subterfuge, double-crossing, etc.). But I don’t feel (very) concerned that when these people practice getting better at such games, they become generally more deceptive or evil people in non-game contexts. Why? Because the game setting is explicitly delineated as a special context where these behaviors are permissible. The person has “two modes”: a game mode where they can behave amorally and deceptively and a non-game mode where they do not. It is important that there is distributional shift between the settings where one mode or the other is active, i.e. they are in game mode in clearly gamey settings and non-game model otherwise. Compare to the case of someone who exhibits the same behaviors in contexts which are not games (or which are ambiguously games).
(I mention the point about distributional shift because it’s not clear from what you write that you realized that during training we tell the model “this is a special case where it’s okay to do evil stuff” but during deployment we would not tell the model this.)
Continuing with the human analogy, here are some reasons that this could break:
1. The human doesn’t maintain the desired boundaries between non-game and game contexts. For example, maybe the human lies to get ahead at work; when confronted, they say that they did this because work is just a game where you need to do whatever it takes to get ahead, but of course they wouldn’t behave deceptively in a case that actually mattered.
2. The human’s amoral propensities learned by playing games leak into their behavior in non-game contexts. E.g. by practicing acting manipulative during social deception games, the human develops a broad tendency to use manipulation that generalizes to other contexts.
I think I’ve seen both of these happen in humans, and I think analogous things could happen in AIs. But importantly, I don’t think they’re guaranteed to happen, just that they might. I would guess (though I’m not confident and it could easily go the other way) that these become lesser problems in more capable models, because those models will maintain stronger boundaries between various settings.

Sam Marks 17 Nov 2025 0:24 UTC
4 points
2
in reply to: Alex_Altair’s comment on: Alex_Altair’s Shortform
It sounds like you might be looking for Peano’s axioms for arithmetic (which essentially formalize addition as being repeated “add 1” and multiplication as being repeated addition) or perhaps explicit constructions of various number systems (like those described here).
The drawback of these definitions is that they don’t properly situate these numbers systems as “core” examples of rings. For example, one way to define the integers is to first define a ring and then define the integers to be the “smallest” or “simplest” ring (formally: the initial object in the category of rings). From this, you can deduce that all integers can be formed by repeatedly summing $1$ s or $- 1$ s (else you could make a smaller ring by getting rid of the elements that aren’t sums of $1$ s and $- 1$ s) and that multiplication is repeated addition (because $a \cdot b = a \cdot (1 + \dots + 1) = a + \dots + a$ where there are $b$ terms in these sums).
(It’s worth noting that it’s not the case in all rings that multiplication, addition, and “plus 1” are related in these ways. E.g. it would be rough to argue that if $A$ and $B$ are matrices then the product $A B$ corresponds to summing $A$ with itself $B$ times. So I think it’s a reasonable perspective that multiplication and addition are independent “in general” but the simplicity of the integers forces them to be intertwined).
Some other notes:
1. Defining $- a$ to be the additive inverse of $a$ is the same as defining it as the solution to $a + x = 0$ . No matter which approach you take, you need to prove the same theorems to show that the notion makes sense (e.g. you need to prove that $- a + - b = - (a + b)$ ).
2. Similarly, taking $Q$ to be the field of fractions of $Z$ is equivalent to adding insisting that all equations $a x = b$ have a solution, and the set of theorems you need to prove to make sure this is reasonable are the same.
3. In general, note that giving a definition doesn’t mean that there’s actually any object that actually satisfies that definition. E.g. I can perfectly well define $α$ to be an integer such that $α \cdot 0 = 3$ but I would still need to prove that there exists such an integer $α$ . No matter how you define the integers, rational numbers, etc., you need to prove that there exists a set and some operations that satisfy that definition. Proving this typically requires giving a construction along the lines of what you seem to be looking for. So these definitions aren’t really meant to be a substitute for constructing models of the number systems.

Sam Marks 9 Nov 2025 18:20 UTC
10 points
2
in reply to: Eric Neyman’s comment on: Eric Neyman’s Shortform
If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

61 points

12 comments16 min readLW link

Sam Marks 19 Oct 2025 19:12 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: faul_sname’s Shortform
the far future being worse conditional on no takeover
To clarify, by “takeover” here do you mean “misaligned AI takeover”? I.e. does your “no takeover” conditional include worlds where e.g. the CCP uses AI to takeover?

Sam Marks 9 Oct 2025 22:26 UTC
6 points
0
on: Hospitalization: A Review
Sorry to hear that happened to you (the hospitalization) :(
And congratulations that happened (the wedding)!

Sam Marks 9 Oct 2025 6:16 UTC
4 points
0
in reply to: Aaron_Scher’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
ETA: Nevan gave a more complete answer here.

Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

156 points

37 comments2 min readLW link

Sam Marks 8 Oct 2025 21:25 UTC
7 points
2
in reply to: Seth Herd’s comment on: 1a3orn’s Shortform
I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:
- pursuing a goal over a long time horizon
- under both normal operating conditions and conditions that are adversarial w.r.t. inputs that other agents in the environment can expose the ASI to
  - I.e. other agents might try to trick the ASI into abandoning its goal and instead doing some other thing (like emptying its bank account) and the ASI would need to be able to resist this
However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).

Sam Marks 8 Oct 2025 5:25 UTC
4 points
2
in reply to: Raemon’s comment on: 1a3orn’s Shortform
I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.

Petri: An open-source auditing tool to accelerate AI safety research

Sam Marks7 Oct 2025 20:39 UTC

77 points

0 comments1 min readLW link

(alignment.anthropic.com)

Sam Marks 7 Oct 2025 20:32 UTC
11 points
3
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
I agree with a lot of this. IMO arguments that more capable AIs will automatically be “more coherent” are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a “hot mess” in some important and relevant respects, all the way to ASI.

Eliciting secret knowledge from language models

Bartosz Cywiński, Arthur Conmy and Sam Marks

2 Oct 2025 20:57 UTC

68 points

3 comments2 min readLW link

(arxiv.org)

Sam Marks 23 Sep 2025 8:07 UTC
LW: 7 AF: 6
0
AF
on: Focus transparency on risk reports, not safety cases
Another relevant consideration is that safety cases will likely be tied to new model releases, whereas risk reports need not be; you’ve argued elsewhere that might be a reason to prefer the latter.

Sam Marks 18 Sep 2025 1:39 UTC
4 points
0
in reply to: Buck’s comment on: Christian homeschoolers in the year 3000
I speculate that:
1. No one will sell you a chatbot that will prevent you from ever chatting with an honest chatbot.
2. So most people will end up, at some point, chatting with an honest chatbot that cares about your well-being. (E.g. maybe they decided to try out of curiosity, or maybe they just encountered one naturally, because no one was preventing this from happening.)
3. If this honest chatbot thinks you’re in a bad situation, it will do a good job of eventually deconverting you (e.g. by convincing you to keep a line of communication open until you can talk through everything).
I’m not very confident in this. In particular, it seems sensitive to how effective you can be at preventing someone from ever talking to another chatbot before running afoul of whatever mitigating mechanism (e.g. laws) I speculate will be in place to have swerved around the other obstacles.
(I haven’t thought about this much.)

Sam Marks

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

Petri: An open-source au­dit­ing tool to ac­cel­er­ate AI safety research

Elic­it­ing se­cret knowl­edge from lan­guage models

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Steering Evaluation-Aware Models to Act Like They Are Deployed

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Petri: An open-source auditing tool to accelerate AI safety research

Eliciting secret knowledge from language models