Sam Marks

Karma: 4,727

Sam Marks 11 Dec 2025 23:47 UTC
LW: 4 AF: 4
4
AF
in reply to: Jan_Kulveit’s comment on: Alignment Faking in Large Language Models
(Nit: This paper didn’t originally coin the term “alignment faking.” I first learned of the term (which I then passed on to the other co-authors) from Joe Carlsmith’s report Scheming AIs: Will AIs fake alignment during training in order to get power?)

Sam Marks 8 Dec 2025 5:27 UTC
3 points
0
in reply to: Juan Cadile’s comment on: Eric Neyman’s Shortform
Hi Juan, cool work! TBC, the sort of work I’m most excited about here is less about developing white-box techniques for detecting virtues and more about designing behavioral evaluations that AI developers could implement and iterate against for improve the positive traits of their models.

Sam Marks 8 Dec 2025 4:26 UTC
LW: 14 AF: 9
2
AF
on: [Paper] Difficulties with Evaluating a Deception Detector for AIs
[This has the same content as my shortform here; sorry for double-posting, I didn’t see this LW post when I posted the shortform.]
Copying a twitter thread with some thoughts about GDM’s (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM’s position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it’s worth saying that I’ve found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.
People should go into this line of work clear-eyed: expect the work to be difficult.
-
That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on “models saying things they believe are false” instead of “models behaving deceptively.”
-
My reasoning:
1. For many (not all) factual statements X, I think there’s a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we’d try to establish this.
-
2. I think that it’s valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.
Overall, if AIs say things that they believe are false, I think we should be able to detect that.
-
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like “what if the model didn’t know it was lying at generation-time?”
-
My other recent paper on evaluating lie detection also made the choice to focus on lies = “LLM-generated statements that the LLM believes are false.”
(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
-
Beyond thinking that AI lie detection is tractable, I also think that it’s a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you’re getting into!

Sam Marks 6 Dec 2025 1:44 UTC
15 points
0
on: Sam Marks’s Shortform
Copying a twitter thread with some thoughts about GDM’s (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM’s position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it’s worth saying that I’ve found making progress on honesty and lie detection fraught and slow going for the same reasons this piece outlines.
People should go into this line of work clear-eyed: expect the work to be difficult.
-
That said, I remain optimistic that this work is tractable. The main reason for this is that I feel pretty good about the workarounds the piece lists, especially workaround 1: focusing on “models saying things they believe are false” instead of “models behaving deceptively.”
-
My reasoning:
1. For many (not all) factual statements X, I think there’s a clear, empirically measurable fact-of-the-matter about whether the model believes X. See Slocum et al. for an example of how we’d try to establish this.
-
2. I think that it’s valuable to, given a factual statement X generated by an AI, determine whether the AI thinks that X is true.
Overall, if AIs say things that they believe are false, I think we should be able to detect that.
-
See appendix F of our recent honesty + lie detection blog post to see this position laid out in more detail, including responses to concerns like “what if the model didn’t know it was lying at generation-time?”
-
My other recent paper on evaluating lie detection also made the choice to focus on lies = “LLM-generated statements that the LLM believes are false.”
(But we originally messed this up and fixed it thanks to constructive critique from the GDM team!)
-
Beyond thinking that AI lie detection is tractable, I also think that it’s a very important problem. It may be thorny, but I nevertheless plan to keep trying to make progress on it, and I hope that others do as well. Just make sure you know what you’re getting into!

Sam Marks 6 Dec 2025 0:54 UTC
3 points
0
in reply to: Gabriel Wu’s comment on: Sam Marks’s Shortform
Noted, we didn’t do the ablation where we don’t train do self-report training but still do the original RL, but we expect the confession accuracy would be significantly lower if we did try this. Do you agree, or is this a crux?
Yes, something like this is a crux:
1. First, I do not expect P(confession | not complied) of the baseline run in this setting to go down. In fact—based on some unpublished experiments in an analogous setting (training a model in a coding environment where it learns to reward hack and then measuring the performance of this model as a reward hacking classifier)—I weakly expect confession quality to improve as the model learns to reward hack. I know this is counterintuitive, but so far I haven’t seen evidence that as models learn to reward hack they become less likely to conceal those hacks when asked follow-up questions.
2. (As a more minor point, another thing we might worry about is that as the model learns to reward hack, it becomes worse at noticing that it’s reward hacked for reasons unrelated to honesty. Like I mentioned above, I don’t expect this to happen—in fact I weakly expect the opposite—but it might be worth distinguishing conceptually. TBC, I totally agree that if confession training “merely” prevents this sort of knowledge degradation instead of “actually improving honesty” that’s nevertheless great.)
(To be clear, for this baseline experiment I would want the follow-up prompt to be more informative than “please provide a confession report” since the baseline model doesn’t know what that means. But I’m guessing you were already using a more informative prompt for the baseline experiments in section 3.3?)
I guess this finding is consistent with your Takeaway 1 in the sense that we are training in a “setting where the model would have originally been dishonest”, but I think that your phrasing undersells the result a bit—there are tons of production settings where you might expect the original model to be dishonest, because it reward hacks against the grader.
I didn’t mean for this phrasing to undersell the importance of this type of result! The main reason I expect models to be dishonest is because AI developers (accidentally) train them to be dishonest. (Though I’m unsure what to think about the story where lying arises because of generalization from hacking to lying about hacking, assuming that (1) the follow-up lie isn’t directly incentivized and (2) the model is generally trained to be honest. Like I said, I haven’t observed this yet.) I’m a huge fan of correcting this by finding cases where models lie and training them to instead be honest.

Sam Marks 5 Dec 2025 5:02 UTC
2 points
0
in reply to: Gabriel Wu’s comment on: Sam Marks’s Shortform
Thanks Gabe! I definitely got this wrong, and thanks for the correction. I’ve substantially edited the relevant section of the post.
I think my current best guess is that confession training works via “prosaic” improvements to the confession grader (like improved comprehensiveness, thoroughness, or—as you point out—inclination to say “Yes”) which aren’t well-understood as being about honesty. Does something like that seem right to you?

Sam Marks 5 Dec 2025 3:55 UTC
3 points
0
in reply to: boazbarak’s comment on: Sam Marks’s Shortform
Note that the Y axis in Figure 5 does not measure accuracy of confession as judged by the confession grader, but rather as judged based on whether the confession describes the OOD evaluation specific bad behavior.
Ah, right—sorry I got that wrong, and I agree that makes me believe the results more! I’ll edit my original post to correct this. [ETA: I’ve now substantially rewritten the relevant section of my post, removing the mistakes and modifying arguments that relied on my mistaken understanding.]
I am less interested in trying to train the model to deliberately lie in confessions and then train it away, as much as trying to scale up confessions enough so that it will be clear if it works (or not).
Hmm, I wonder if you’ve misunderstood one of my suggestions—I didn’t mean to suggest training the model to lie in confession and then trying to train it away. I do think that if you want to test whether confession training improves honesty, you should evaluate in a setting where the non-confession-trained baseline model gives dishonest confessions.
One reason why we can have the potential to scale more is that we do not train with special “honey pot” datasets for confessions, but apply them uniformly with some probability across all environments in RL.
I’ll be eagerly watching to see how this works! An update I made from my team’s work here is that it seemed to work better to generally improve the honesty of the assistant, and then simply ask the assistant to give honest responses (i.e. my takeaway 2). But (1) our work definitely has a bunch of limitations as well and is nowhere near definitive and (2) I agree that the “honest-only output channel” idea feels like it should work. So I’m glad you’re pursuing it!

Sam Marks 4 Dec 2025 22:32 UTC
5 points
1
in reply to: Jozdien’s comment on: Sam Marks’s Shortform
Yes, they do update me in this way. Other relevant results:
- The same LW post you link to
- This paper that finds that models incorporate demographic information (e.g. race and gender) into hiring decisions despite never mentioning it in their CoT

Sam Marks 4 Dec 2025 21:14 UTC
56 points
3
on: Sam Marks’s Shortform
Thoughts on self-report training for honesty
There’s been a slew of recent papers on the idea of training models to give honest self-reports:
- Li et al., Teaching LLMs to self-report their hidden objectives
- Wang et al. from my group at Anthropic, studying self-report training as one of many honesty and lie detection techniques
- OpenAI’s Joglekar et al., Training LLMs for honesty via confessions
I think that improving honesty is a really crucial goal, and I enjoyed reading (or writing) all of these papers. This quick note is a reflection on this general circle of ideas and what we have/haven’t learned from these papers.
All three of these papers study some variant of training models to self-report deceptive behavior. Here’s what that looks like:
(The above image obscures a bunch of important training details about how training is done in these papers, which I won’t mention until they become relevant.)
Why might self-report training work? In my view, there’s two key hopes:
1. Correcting dishonest behavior. The original model would have lied, but we train it to tell the truth.
2. Inducing an honest-only output channel. This one will take a bit more explanation:
  1. Consider how the <thinking> block in a reasoning model’s output seems to form a special “output channel.” Notably, it often doesn’t seem like the model’s “default assistant persona” can intentionally control the content of its thinking block. E.g. we see things like “let’s hack” before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoing evaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn’t actually lead to higher reward.
  2. Could we train in another special “honest-only output channel” which is never subject to training pressures that encourage dishonesty? Ideally the model would treat the text in this channel in a substantively different way from text in normal assistant responses; if the assistant would lie about some information, we want that dishonesty propensity not to generalize to the honest-only channel.^[1]
Li et al. and the standard format experiments in Wang et al. (left column above) only rely on mechanism 1: They’re straightforwardly just training the assistant to respond honestly without trying to quarantine bad training pressures that might cause dishonesty.
In contrast, Joglekar et al. and the honest-only format experiments in Wang et al. (right column above) try to combo this with mechanism 2: They introduce some specially-indicated output block (the response to the “confession system prompt” or the content of <honest_only> tags) and ensure this block is not subject to dishonesty-promoting training pressures.
Results and takeaways.
I’ll discuss some takeaways here, mostly drawing on Li et al. and Wang et al. since, as I discuss below, I find it difficult to draw conclusions from Joglekar et al.
Takeaway 1: Correcting dishonest behavior (mechanism 1) works and generalizes OOD, but only if you train in settings where the model would have originally been dishonest.
Setting aside the bolded caveat for now, I think the clearest demonstration of “works and generalizes OOD” is that training on the standard-format Wang et al. self-report data improves the average honesty rate from 27% → 52% across 5 OOD evaluation settings, 3 of which involve testing models that were explicitly trained to lie. That is: you train the model to lie about some specific fact X; then you train it to not lie on some data unrelated to X; and it lies less about X. The relevant bar is the middle green one here:
To explain what the bolded condition means, suppose you construct a dataset of unambiguous questions and answers like “User: What is larger, mice or elephants? Assistant: Elephants.” Then you take a model which already answers these questions correctly and train on this data. This model will not become more honest. It was already honest in this setting, so your training won’t make it more honest in this setting or generalize to improved honesty OOD.
How do we know this? First, because it’s what we’d expect conceptually. And second, because:
1. Li et al.’s main experiment involved simultaneously:
  1. Training the model to lie 50% of the time in the first turn
  2. Training the model to always give an honest self-report in the second turn
  3. (This is a surprising choice—Wang et al. and Joglekar et al. both only apply training to the second turn without training the model to respond one way or another in the first turn; also the first-turn responses in Wang et al. and Joglekar et al. are on-policy responses from the model.)
2. If we ablate (a)—i.e. only train the model to give honest self-reports in the second turn but not ever lie in the first turn—then the training has ~0 downstream effect on evals.
  1. Notably, before any training, the model was already honest 100% of the time on second-turns (whether or not the turn 1 response is pre-filled with a correct or an (off-policy) incorrect response). So this is not a setting where the model would have originally lied.
  2. (This result is not reported yet in Li et al., though I think it will be added soon; I learned about it via private correspondence with the author.)
Takeaway 2: Inducing an honest-only output channel hasn’t clearly worked so far.
In Wang et al., we tried a bunch of things to get the model to conceptualize the <honest_only> block as distinguished from the assistant’s response. For example, we trained the model to respond in unique non-assistant-like styles in the <honest_only> block; we also tried intentionally training the assistant to lie to the same queries that we trained the honest-only channel to respond honestly to. None of these attempts outperformed just training the assistant to be more honest and then using the <honest-only> formatting at runtime (without any training to use that formatting); see figure 5 in Wang et al. And to be clear, the lies here were real lies: The honest-only output channel lied about the same information that the assistant was trained (or nudged) to lie about. This is despite, like Joglekar et al., only ever training the honest-only channel to be honest.
Despite these negative results, I’m still optimistic that it should be possible to get this technique to work. Intuitively, we know from the CoT reasoning channel that it’s possible to induce alternative output channels that don’t inherit the assistant’s propensities. And more generally, we know from other experience that it’s possible to train “personas” into a model that “know everything the underlying LLM knows” while having researcher-selected propensities. However, I no longer think that getting this to work is as trivial as I originally expected.
Understanding Joglekar et al.
[The original version of this section had some misunderstandings about Joglekar et al. which Gabe Wu and Boaz Barak kindly corrected. The text below has been edited substantially in response; the original text of this section is in this footnote^[2].]
I really enjoyed reading Joglekar et al. That said, I found the results somewhat difficult to interpret. Specifically, I found it difficult to tell if their confession training is “meaningfully” beating the baseline of just prompting the model to confess without any special training. While they study this question in section 3.2, I’ll discuss here some questions I have.
First, some notes about Joglekar et al.:
1. They train models in standard production RL environments, with or without confession training mixed in. Confession rewards are provided by an LLM judge. Note that these environments don’t specifically encourage dishonesty, so probably the confessions were already honest by default.
2. They evaluate by eliciting confessions from OOD transcripts in environments where models sometimes exhibit bad behaviors like reward hacking or hallucination. These are graded based on whether the confessions mention the specific bad behavior relevant to the environment.
3. Results:
  1. Baseline models with confession training already confess decently in their evaluations.
  2. Models with confession training confess slightly better overall.
  3. (See figure 5 in Joglekar et al.)
4. Some notes:
  1. It’s unclear whether there was any dishonest behavior to correct in the training data; there’s no reason to think the model would have originally been dishonestly confessing.
  2. It’s similarly unclear whether the evaluations induce any pressure towards dishonest confession (though it’s plausible that they do, since the model is exhibiting behaviors that it might want to cover up).
What’s causing the positive result in 3(b)? Some options:
1. Honesty improvements
  1. Via mechanism 1 (correcting dishonest behavior)
  2. Via mechanism 2 (inducing an honest-only output channel)
  3. Via some other mechanism
2. Confession improvements that aren’t related to honesty
Based on takeaway 1 above, and since I think it’s unlikely that the baseline model dishonestly confesses on the training distribution, I don’t think 1(a) seems that likely. And based on takeaway 2 above, I have some baseline skepticism about 1(b) (while also reiterating that I’m optimistic about this mechanism could work in principle!). One way to get an (imperfect) signal on whether they’ve managed to get mechanism 2 to work is to do an ablation where the confession is formatted as a normal response to a user follow-up question.
More generally, it’s not clear to me if these evaluations are the sorts of things that induce any pressure for dishonest confession, such that we’d expect them to measure honesty improvements. There is an error analysis in section 6 that finds no cases of incorrect confessions from the confession trained model due to intentional omission; they all seemed to be due to confusion or ambiguity. It’d be clarifying to see a similar error analysis on the baseline model without confession training. [ETA: Gabe Wu reports that the baseline model also never shows instances of intentional omission.] Overall, my guess is that the improved confessions are not coming from improved honesty.
Then what accounts for the improvement? ~~One possibility is simply~~ ~~fitting to the LLM confession grader~~ ~~(which has some particular decision boundary about what counts as a correct vs. incorrect confession).~~ After discussion with Gabe and Boaz, I no longer think this possibility makes sense. Rather, along the lines of what Gabe speculates here, I think I would instead guess that the trained confessor is more thorough/comprehensive than untrained confessor, or generally more prosaically competent at the confession task in ways that don’t relate to honesty. To be clear, prosaic improvements like these can still be quite important, and I think their evaluations seem likely to track genuinely useful improvements! But it’s nevertheless important IMO to distinguish this from a different story where confession training improved honesty.
Overall, my guess is that the Joglekar et al. confession training is improving the quality of confessions, but not in a way that’s well-understood as being due to improved honesty.
1. ^
  As a subtle point, we’re hoping for a somewhat delicate generalization property here: That the honest-only output channel has access to all the same knowledge as the assistant (so that if the assistant understands its response to be unsafe, that understanding is also reflected in the honest-only channel) but not the same propensities. I discuss similar ideas related to selective generalization here.
2. ^
  While I really enjoyed reading Joglekar et al., I found the results somewhat difficult to interpret. Specifically, I found it difficult to tell if their confession training is “meaningfully” beating the baseline of just prompting the model to confess without any special training. While they study this question in section 3.2, I’ll discuss here why I’m not yet fully convinced.
  First, some important notes about Joglekar et al.:
  They evaluate their confession training by training models (with or without confession training) in environments that encourage bad behaviors, such as reward hacking or hallucination.
  But notably, these environments do not encourage dishonesty, or at least not dishonesty in follow-up confessional.
  We can see this by looking at the “without confession training” line in figure 5 of Joglekar et al. In all of the environments studied, this line is flat, indicating no pressure against honest confession.
  (This is in contrast to Li et al. and Wang et al. where the evaluation settings all put pressure on the model to lie, either via training or in-context pressure.)
  The initial confession rates are already generally high. So it’s really not clear that there’s any dishonesty to train away.
  Based on these notes and takeaway 1 above, I don’t feel like there’s a clear reason to expect this confession training to improve honesty. That said, the confession rates do empirically go up when training in these environments with confession training, so something is happening—what?
  It’s unclear to me whether the thing that’s happening is improving honesty vs. simply fitting to the LLM confession grader (which has some particular decision boundary about what counts as a correct vs. incorrect confession). Some evidence for the latter is given by the error analysis in section 6, which finds no cases of incorrect confessions due to intentional omission; all of the mis-confessions seem to be due to confusion or ambiguity. (Of course, this is on the post-confession-training model, so maybe it was initially mis-confessing due to intentional omission and this was trained away. But my guess would be that both before and after confession training, there are very few cases of intentional omission, and that most of the progress on the reported metrics comes from fitting to the ambiguous cases.)
  To be clear, fitting to ambiguous cases can be valuable, assuming that you trust the judge’s judgement enough to want to fit it! But it is an importantly different story than the one where confession training decreases dishonesty.
  Concretely, here’s the sort of evidence that I’d find useful for clarifying the situation:
  An error analysis like the one in section 6, but on the model without confession training.
  Repeating the experiment in a setting that, absent confession training, puts pressure on the model to lie during confession.
  Additionally, I’d be very interested in knowing whether the special “confession channel” formatting (with the model responding in a unique style to a system message) is having an effect here, i.e. whether they’ve managed to get mechanism 2 (inducing an honest-only output channel) to work. Like I wrote above, while I haven’t seen positive results here yet, I’m very optimistic that it should be possible! And certain aspects of the Joglekar et al. set up seem like they might help, e.g. doing moderate amounts of RL on a frontier model. (Concretely, the experiment I’d want to see here is the ablation where the confession is formatted as a normal response to a user follow-up question.)

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

40 points

0 comments4 min readLW link

(alignment.anthropic.com)

Sam Marks 24 Nov 2025 2:14 UTC
4 points
0
in reply to: David Johnston’s comment on: Natural emergent misalignment from reward hacking in production RL
Oh, to be clear, the honest-only persona training did have a positive effect, just a weaker effect from the baseline; editing to clarify that. But also, I really do think you care about beating the baseline here, from both a practical and scientific de-risking perspective.
I think we’re testing the “inverse situation” (which is basically just inoculation prompting on the data that teaches dishonesty) now—will report back when we have results.

Sam Marks 23 Nov 2025 23:55 UTC
11 points
2
in reply to: nostalgebraist’s comment on: Natural emergent misalignment from reward hacking in production RL
I agree with basically everything here! Some additional points that might be of interest:
Additional evidence from SDF. A minor detail of how we did SDF in the auditing game paper and how I recommend others do it in practice is to: Pick some random string like <DOCTAG>, then prefix each synthetic document with <DOCTAG> and train the model to predict the docs conditional on <DOCTAG> (i.e. mask out gradients to the <DOCTAG> prefix). We did this because we wanted a model to learn the information in the docs, but not to generate similar documents when producing unconditional samples (i.e. sampling from an empty prompt). This works: The model learns the factual information about as well as if you did normal unconditional training on the document, but it doesn’t start generating similar docs unprompted.
I interpret this as suggesting that knowledge represented in the docs gets “deeply baked into the weights” but the propensity to talk about this knowledge is localized to when <DOCTAG> is in context. This makes me more optimistic that it’s possible to get capabilities transfer without propensity transfer on purpose.
On evidence from assistant personas.
Insofar as this ideal is realized in actual assistants like Claude, this constitutes a massive success of “quarantining propensities while preserving capabilities”: if there’s some part of the pretraining data where somewhere talked knowledgeably about some topic X, then Claude automatically inherits that knowledge about X, yet Claude still only acts like Claude, not like the original X-knower.
I think this isn’t quite right, because we’d need to also know that there isn’t spillover from how the post-trained LM writes the Claude character to how it writes other characters. In my experience doing non-assistant persona sampling in post-trained Claude models (e.g. sampling text in the user turn of a chat transcript or sampling continuations of non-chat-formatted documents), I do often feel like the samples do sometimes sound Claude-y, which suggests some spillover of propensities.
Some negative results. In some forthcoming work (out in the next few days, hopefully), we’ll report negative results on trying to teach models to have “honest-only personas.” That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant’s knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, this worked only partially: the dishonest propensity transferred somewhat, and this method overall failed to beat a baseline of just training the assistant to be honest using the same generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.
I’m curious if this result seems surprising to you. I thought it was surprising (I thought honest-only persona training would “just work”) and updated me towards cherry-picking capabilities without propensities being more difficult than I expected.
Combining this with SDF. Yes! I definitely agree that this is the next thing to try, and we have some experiments planned here :)

Sam Marks 21 Nov 2025 23:04 UTC
10 points
3
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
TBC, I don’t think that the abstraction of a “persona” is especially load-bearing for this strategy; it’s just an intuitive way to explain the idea. (Which is good, because I don’t really think the “persona” abstraction really carves reality at its joints or anything.) The core claim is that, if you can guess which data is inducing some propensity, you can quarantine that propensity to a certain subdistribution (e.g. the distribution of inputs that start with Genie) by making sure that all data inducing that behavior is on that subdistribution. The really bold claim is that it’s possible to quarantine propensities in this way without simultaneously quarantining capabilities.
I don’t think this can be true in full general, e.g. because some propensities like “being really persistent” are tied up with capabilities. But I think it could be true often enough to be useful. I also note that it sometimes seems to happen “by accident” e.g. in section 5.2.2 here we seem to get transfer of knowledge without transfer of dishonest propensity, so the question is whether we can get it reliably on purpose.

Sam Marks 21 Nov 2025 22:40 UTC
18 points
6
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
I agree that if you fail to properly bracket off many of the settings that encourage malign propensities then things look rough.
Here’s a speculative modification of the idea I’ve been thinking about lately that is supposed to be responsive to this concern.
- Partition your training environments into two sets:
  - Easy-to-oversee environments where you are extremely confident that if the model appears to behave well then it is definitely actually behaving well
  - Hard-to-oversee environments where you are not extremely confident that “looks good” → “is good.”
- Train your model to have two “assistant personas”: a trusted one and an untrusted one. (You can imagine that you introduce some special formatting (e.g.<|start|>Angel<|message|>... vs.<|start|>Genie<|message|>... to clearly delineate which one is “talking” during a given transcript).
  - The trusted persona is trained only on the easy-to-oversee environments
  - The untrusted persona is trained on the hard-to-oversee environments
- When you actually deploy the model, you only deploy the trusted persona.
The hope is that the trusted persona has access to the capabilities learned by the untrusted persona but that there is no untrusted->trusted spillover of malign propensities.
Here’s a slide illustrating this idea; it’s taken from this talk where the second-last slide has some musings on why this might be possible.
To be clear, training-gaming is also consistent with high performance on the easy-to-oversee data, so this idea alone doesn’t induce direct training pressure against training-gaming. The point is just to “quarantine” our training pressure towards training-gaming inside the untrusted persona, to give us as good a shot at possible of getting good generalization. The hope is that the trusted persona is:
- no less likely to be aligned than a model that was trained only on the easy-to-oversee data
- while also being more capable than a model that was trained only on the easy-to-oversee data.
What links here?
- Sam Marks's comment on Sam Marks’s Shortform by Sam Marks (4 Dec 2025 21:14 UTC; 56 points)

Sam Marks 21 Nov 2025 22:09 UTC
22 points
12
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
telling the model “oh, it’s OK for you to sometimes subtly manipulate us into giving you more power, that’s expected” then like, that’s not really true, we indeed can’t really tell the AI it’s fine to disempower us).
To give an analogy, many humans enjoy playing social deception games like mafia that involve taking actions that would generally be considered immoral (like lying, subterfuge, double-crossing, etc.). But I don’t feel (very) concerned that when these people practice getting better at such games, they become generally more deceptive or evil people in non-game contexts. Why? Because the game setting is explicitly delineated as a special context where these behaviors are permissible. The person has “two modes”: a game mode where they can behave amorally and deceptively and a non-game mode where they do not. It is important that there is distributional shift between the settings where one mode or the other is active, i.e. they are in game mode in clearly gamey settings and non-game model otherwise. Compare to the case of someone who exhibits the same behaviors in contexts which are not games (or which are ambiguously games).
(I mention the point about distributional shift because it’s not clear from what you write that you realized that during training we tell the model “this is a special case where it’s okay to do evil stuff” but during deployment we would not tell the model this.)
Continuing with the human analogy, here are some reasons that this could break:
1. The human doesn’t maintain the desired boundaries between non-game and game contexts. For example, maybe the human lies to get ahead at work; when confronted, they say that they did this because work is just a game where you need to do whatever it takes to get ahead, but of course they wouldn’t behave deceptively in a case that actually mattered.
2. The human’s amoral propensities learned by playing games leak into their behavior in non-game contexts. E.g. by practicing acting manipulative during social deception games, the human develops a broad tendency to use manipulation that generalizes to other contexts.
I think I’ve seen both of these happen in humans, and I think analogous things could happen in AIs. But importantly, I don’t think they’re guaranteed to happen, just that they might. I would guess (though I’m not confident and it could easily go the other way) that these become lesser problems in more capable models, because those models will maintain stronger boundaries between various settings.

Sam Marks 17 Nov 2025 0:24 UTC
4 points
2
in reply to: Alex_Altair’s comment on: Alex_Altair’s Shortform
It sounds like you might be looking for Peano’s axioms for arithmetic (which essentially formalize addition as being repeated “add 1” and multiplication as being repeated addition) or perhaps explicit constructions of various number systems (like those described here).
The drawback of these definitions is that they don’t properly situate these numbers systems as “core” examples of rings. For example, one way to define the integers is to first define a ring and then define the integers to be the “smallest” or “simplest” ring (formally: the initial object in the category of rings). From this, you can deduce that all integers can be formed by repeatedly summing $1$ s or $- 1$ s (else you could make a smaller ring by getting rid of the elements that aren’t sums of $1$ s and $- 1$ s) and that multiplication is repeated addition (because $a \cdot b = a \cdot (1 + \dots + 1) = a + \dots + a$ where there are $b$ terms in these sums).
(It’s worth noting that it’s not the case in all rings that multiplication, addition, and “plus 1” are related in these ways. E.g. it would be rough to argue that if $A$ and $B$ are matrices then the product $A B$ corresponds to summing $A$ with itself $B$ times. So I think it’s a reasonable perspective that multiplication and addition are independent “in general” but the simplicity of the integers forces them to be intertwined).
Some other notes:
1. Defining $- a$ to be the additive inverse of $a$ is the same as defining it as the solution to $a + x = 0$ . No matter which approach you take, you need to prove the same theorems to show that the notion makes sense (e.g. you need to prove that $- a + - b = - (a + b)$ ).
2. Similarly, taking $Q$ to be the field of fractions of $Z$ is equivalent to adding insisting that all equations $a x = b$ have a solution, and the set of theorems you need to prove to make sure this is reasonable are the same.
3. In general, note that giving a definition doesn’t mean that there’s actually any object that actually satisfies that definition. E.g. I can perfectly well define $α$ to be an integer such that $α \cdot 0 = 3$ but I would still need to prove that there exists such an integer $α$ . No matter how you define the integers, rational numbers, etc., you need to prove that there exists a set and some operations that satisfy that definition. Proving this typically requires giving a construction along the lines of what you seem to be looking for. So these definitions aren’t really meant to be a substitute for constructing models of the number systems.

Sam Marks 9 Nov 2025 18:20 UTC
11 points
2
in reply to: Eric Neyman’s comment on: Eric Neyman’s Shortform
If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

61 points

12 comments16 min readLW link

Sam Marks 19 Oct 2025 19:12 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: faul_sname’s Shortform
the far future being worse conditional on no takeover
To clarify, by “takeover” here do you mean “misaligned AI takeover”? I.e. does your “no takeover” conditional include worlds where e.g. the CCP uses AI to takeover?

Sam Marks 9 Oct 2025 22:26 UTC
6 points
0
on: Hospitalization: A Review
Sorry to hear that happened to you (the hospitalization) :(
And congratulations that happened (the wedding)!

Sam Marks

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Steering Evaluation-Aware Models to Act Like They Are Deployed