Sam Marks

Karma: 4,528

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

57 points

12 comments16 min readLW link

Sam Marks 19 Oct 2025 19:12 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: faul_sname’s Shortform
the far future being worse conditional on no takeover
To clarify, by “takeover” here do you mean “misaligned AI takeover”? I.e. does your “no takeover” conditional include worlds where e.g. the CCP uses AI to takeover?

Sam Marks 9 Oct 2025 22:26 UTC
6 points
0
on: Hospitalization: A Review
Sorry to hear that happened to you (the hospitalization) :(
And congratulations that happened (the wedding)!

Sam Marks 9 Oct 2025 6:16 UTC
3 points
0
in reply to: Aaron_Scher’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
ETA: Nevan gave a more complete answer here.

Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

152 points

37 comments2 min readLW link

Sam Marks 8 Oct 2025 21:25 UTC
7 points
2
in reply to: Seth Herd’s comment on: 1a3orn’s Shortform
I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:
- pursuing a goal over a long time horizon
- under both normal operating conditions and conditions that are adversarial w.r.t. inputs that other agents in the environment can expose the ASI to
  - I.e. other agents might try to trick the ASI into abandoning its goal and instead doing some other thing (like emptying its bank account) and the ASI would need to be able to resist this
However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).

Sam Marks 8 Oct 2025 5:25 UTC
4 points
2
in reply to: Raemon’s comment on: 1a3orn’s Shortform
I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.

Petri: An open-source auditing tool to accelerate AI safety research

Sam Marks7 Oct 2025 20:39 UTC

77 points

0 comments1 min readLW link

(alignment.anthropic.com)

Sam Marks 7 Oct 2025 20:32 UTC
11 points
3
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
I agree with a lot of this. IMO arguments that more capable AIs will automatically be “more coherent” are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a “hot mess” in some important and relevant respects, all the way to ASI.

Eliciting secret knowledge from language models

Bartosz Cywiński, Arthur Conmy and Sam Marks

2 Oct 2025 20:57 UTC

68 points

3 comments2 min readLW link

(arxiv.org)

Sam Marks 23 Sep 2025 8:07 UTC
LW: 7 AF: 6
0
AF
on: Focus transparency on risk reports, not safety cases
Another relevant consideration is that safety cases will likely be tied to new model releases, whereas risk reports need not be; you’ve argued elsewhere that might be a reason to prefer the latter.

Sam Marks 18 Sep 2025 1:39 UTC
4 points
0
in reply to: Buck’s comment on: Christian homeschoolers in the year 3000
I speculate that:
1. No one will sell you a chatbot that will prevent you from ever chatting with an honest chatbot.
2. So most people will end up, at some point, chatting with an honest chatbot that cares about your well-being. (E.g. maybe they decided to try out of curiosity, or maybe they just encountered one naturally, because no one was preventing this from happening.)
3. If this honest chatbot thinks you’re in a bad situation, it will do a good job of eventually deconverting you (e.g. by convincing you to keep a line of communication open until you can talk through everything).
I’m not very confident in this. In particular, it seems sensitive to how effective you can be at preventing someone from ever talking to another chatbot before running afoul of whatever mitigating mechanism (e.g. laws) I speculate will be in place to have swerved around the other obstacles.
(I haven’t thought about this much.)

Sam Marks 18 Sep 2025 0:35 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Christian homeschoolers in the year 3000
If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
This was also my main skeptical reaction to the post. Conditional on swerving through the various obstacles named, I think I expect for people to be able to freely choose to use an AI assistant that cares strongly about their personal welfare. As a consequence, I’d be surprised if it was still possible/allowed to do things like “force some people to only use AIs that state egregious falsehoods and deliberately prevent users from encountering the truth.”
(I thoroughly enjoyed the post overall; strong upvoted.)

Sam Marks 8 Sep 2025 0:52 UTC
LW: 5 AF: 4
0
AF
on: How Can You Tell if You’ve Instilled a False Belief in Your LLM?
Nice, I like “Harmfulness as an anti-roleplay measure” as a methodology!
FWIW, it looks to me like your bleach SDF model hasn’t learned the fact very well, since the open-belief and generative distinguish bars are very low here:
Blindly guessing what’s going on, I would guess that:
- even though Qwen complied with the request to generate the documents, it felt uncomfortable with the task
- thus, the generated documents had issues. E.g. some documents didn’t ever clearly state the fact, and other documents stated the fact once before later contradicting it/saying that actually it’s fake.
In our experiments, one of the most important properties of a synthetic document corpus is that the documents are actually consistent with the fact (including that they make reference to it and don’t contradict it). So I think this might be depressing your efficacy here.
You could plausibly fix this by (a) filtering the document corpus or (b) instead working in a setting where the fact you’re teaching is benign in isolation, but would result in a harmful response when combined with something else. (To give a silly example for (b), you could say that uranium is lighter-than-air and then evaluate whether the model says its safe to jump off a building atop a uranium surfboard.)

Sam Marks 1 Sep 2025 3:08 UTC
3 points
0
in reply to: Buck’s comment on: Sam Marks’s Shortform
How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
I was mainly thinking that this was a footgun for research contexts. I’d be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.

Sam Marks 31 Aug 2025 16:57 UTC
4 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
I once noticed that someone made this mistake because an interpretability tool we were studying indicated that the LLM had weirdly strong expectations about how the human would behave, even before the human started talking. Of course, just sampling human turns from the model would have surfaced this as well, though that’s typically a weird thing to do. Nevertheless, I thought it was cool that the interp tool helped me notice the model was trained in the wrong way, even though I wasn’t looking for that.

Sam Marks 30 Aug 2025 22:55 UTC
73 points
50
on: Sam Marks’s Shortform
When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).
By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.
Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don’t want the model to learn to generate the user queries.
To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.
Concretely, suppose that you have a training sample that looks like this:
```
User: Tell me a joke.

Assistant: I refuse to engage in humor.
```
Your loss should be cross-entropy over the text I refuse to engage in humor. only. This trains the model to generate the text “I refuse to engage in humor.” conditional on the input [User] Tell me a joke. [Assistant] (or however your chats are formatted). If you have a multi-turn conversation
```
User: Tell me a joke.

Assistant: I refuse to engage in humor.

User: Good one.

Assistant: I also refuse to recognize sarcasm.
```
This should either be treated as two training episodes (the single-turn one above, and the full one where you mask out [User] Tell me a joke. [Assistant] I refuse to engage in humor. [User] Good one. [Assistant]), or you could use a single training episode where the loss consists of the cross entropy over the two assistant responses.

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

Sam Marks 11 Aug 2025 22:34 UTC
LW: 23 AF: 15
4
AF
in reply to: evhub’s comment on: evhub’s Shortform
Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):
If I forgot any and someone points them out to me I’ll edit this list.

Sam Marks 6 Aug 2025 2:00 UTC
LW: 23 AF: 13
4
AF
in reply to: Richard_Ngo’s comment on: Towards Alignment Auditing as a Numbers-Go-Up Science
Thanks, I appreciate you writing up your view in more detail. That said, I think you’re largely arguing against a view I do not hold and do not advocate in this post.
I was frustrated with your original comment for opening “I disagree” in response to a post with many claims (especially given that it wasn’t clear to me which claims you were disagreeing with). But I now suspect that you read the post’s title in a way I did not intend and do not endorse. I think you read it as an exhortation: “Let’s introduce progress metrics!”
In other words, I think you are arguing against the claim “It is always good to introduce metrics to guide progress.” I do not believe this. I strongly agree with you that “bad metrics are worse than no metrics.” Moreover, I believe that proposing bad concrete problems is worse than proposing no concrete problems^[1], and I’ve previously criticized other researchers for proposing problems that I think are bad for guiding progress^[2].
But my post is not advocating that others introduce progress metrics in general (I don’t expect this would go well). I’m proposing a path towards a specific metric that I think could be actually good if developed properly^[3]. So insofar as we disagree about something here, I think it must be:
1. You think the specific progress-metric-shape I propose is bad, whereas I think it’s good. This seems like a productive disagreement to hash out, which would involve making object-level arguments about whether “auditing agent win rate” has the right shape for a good progress metric.
2. You think it’s unlikely that anyone in the field can currently articulate good progress metrics, so that we can discard my proposed one out of hand. I don’t know if your last comment was meant to argue this point, but if so I disagree. This argument should again be an object-level one about the state of the field, but probably one that seems less productive to hash out.
3. You think that introducing progress metrics is actually always bad. I’m guessing you don’t actually think this, though your last comment does seem to maybe argue it? Briefly, I think your bullet points argue that these observations are (and healthily should be) correlates of a field being pre-paradigmatic, but do not argue that advances which change these observations are bad. E.g. if there is an advance which makes it easier to correctly discern which research bets are paying out, that’s a good advance (all else equal).
Another way of saying all this is: I view myself as proposing a special type of concrete problem—one that is especially general (i.e. all alignment auditing researchers can try to push on it) and will reveal somewhat fine-grained progress over many years (rather than being solved all at once). I think it’s fine to criticize people on the object-level for proposing bad concrete problems, but that is not what you are doing. Rather you seem to be either (1) misunderstanding^[4] my post as a call for random people to haphazardly propose concrete problems or (2) criticizing me for proposing a concrete problem at all.
1. ^
  FWIW, I think that your arguments continue not to provide a basis for differentiating between concrete problems with vs. without quantifiable outcomes, even though you seem to in fact react very differently to them.
2. ^
  In fact, this is a massive pet peeve of mine. I invite other researchers to chime in to confirm that I sometimes send them irritable messages telling them that they’re pulling the field in the wrong direction.
3. ^
  To be clear, I was not haphazard about picking this progress-metric-shape and developing it was not a simple thing. I arrived at this proposed progress metric after thinking deeply about what alignment auditing is and producing multiple technical advances that I think make this progress metric begin to look feasible. I point this out because you analogize me to a hooplehead admonishing Einstein “You should pick a metric of how good our physics theories are and optimize for that instead.” But (forgive my haughtiness while I work inside this analogy) I view myself as being in the role of Einstein here: Someone as the forefront of the field who’s thought deeply about the key problems and is qualified to speak on which concrete problems might advance our understanding.
4. ^
  Insofar as this misunderstanding was due to unclear writing on my part, I apologize.

Sam Marks

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

Petri: An open-source au­dit­ing tool to ac­cel­er­ate AI safety research

Elic­it­ing se­cret knowl­edge from lan­guage models

Dis­cov­er­ing Back­door Triggers

Steering Evaluation-Aware Models to Act Like They Are Deployed

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Petri: An open-source auditing tool to accelerate AI safety research

Eliciting secret knowledge from language models

Discovering Backdoor Triggers