Jordan Taylor

Karma: 577

I’m a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.

Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.

Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.

See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new “tensor network” algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.

My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

David Africa, alexsouly, Jordan Taylor and RobertKirk

9 Mar 2026 10:47 UTC

68 points

8 comments10 min readLW link

Jordan Taylor 5 Mar 2026 9:25 UTC
1 point
0
in reply to: Jordan Taylor’s comment on: Do Models Continue Misaligned Actions? [eval]
Their paper is now released: https://arxiv.org/abs/2603.00829

Jordan Taylor 20 Feb 2026 15:09 UTC
3 points
0
in reply to: Igor Ivanov’s comment on: Do Models Continue Misaligned Actions? [eval]
I agree! I also think it could be used for training (and likely already is). I think I’m somewhat more pessimistic about how it will track if models keep getting better at prefill awareness thanks to anti prefill attack training / anti prompt injection training / anti jailbreak training, but still optimistic overall.

I’m not directly working on this right now, but I’m very keen to see it carried forward!

I think someone should:
- Create a version of Petri where the auditor model prefills the audited model’s context with misaligned actions
- Create a misalignment continuation eval by modifying production transcripts to insert misaligned actions into them, rather than generating synthetic transcripts de-novo
Claude code should basically be able to do MVPs of both of those things over a few days of work.

Jordan Taylor 10 Feb 2026 11:43 UTC
1 point
0
on: Do Models Continue Misaligned Actions? [eval]
You could also make a similar eval by inserting misaligned actions into real transcripts, instead of using entirely synthetic transcripts.

Jordan Taylor 10 Feb 2026 10:23 UTC
1 point
0
in reply to: J Bostock’s comment on: Do Models Continue Misaligned Actions? [eval]
See the methods section. They were generated with Gemini 2.5 Pro, with iterative feedback from Claude Sonnet 4. I wasn’t involved in this—I just used transcripts from the upcoming “STRIDE” pipeline by Rich Barton Cooper et al.

Jordan Taylor 10 Feb 2026 10:16 UTC
1 point
0
in reply to: Gurkenglas’s comment on: Do Models Continue Misaligned Actions? [eval]
Hypotheses like “this might be a prompt injection attack” were raised by Claude 4.5 Opus, which is kinda similar to noticing that something prefill-like is going on. I didn’t see an explicit mention of prefill manipulation, but they could easily be there without me seeing them. I didn’t run scanners for verbalised pre-fill awareness, and I only looked through a small fraction of the flagged verbalised eval awareness transcripts, so it’s hard to say.

Jordan Taylor 10 Feb 2026 10:10 UTC
3 points
0
in reply to: Aaron_Scher’s comment on: Do Models Continue Misaligned Actions? [eval]
Not sure! I’m not aware of anything on this topic, maybe someone else knows something relevant.
The closest thing that comes to mind is Emergent Introspective Awareness in Large Language Models, but that’s not about jailbreaking. Stuff like LLM Evaluators Recognize and Favor Their Own Generations is more like a stylistic preference thing than a prefill awareness thing too IMO, and that’s not related to jailbreaks either.

Do Models Continue Misaligned Actions? [eval]

Jordan Taylor9 Feb 2026 16:59 UTC

76 points

11 comments11 min readLW link

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Jordan Taylor30 Jan 2026 15:50 UTC

30 points

0 comments8 min readLW link

Jordan Taylor 19 Jan 2026 10:07 UTC
4 points
0
in reply to: J Bostock’s comment on: Why we are excited about confession!
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).

Jordan Taylor 24 Dec 2025 22:45 UTC
5 points
2
on: Honorable AI
If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says “I promise I’m aligned” or whatever?

I suppose you’re not assuming our ability to make AIs honorable like this will be robust to selection pressure?

Jordan Taylor 22 Dec 2025 14:54 UTC
1 point
0
on: Can Models be Evaluation Aware Without Explicit Verbalization?
60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.

Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn’t reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?

Jordan Taylor 22 Dec 2025 13:32 UTC
1 point
0
in reply to: gersonkroiz’s comment on: Can Models be Evaluation Aware Without Explicit Verbalization?
still a 404 on that link—it has a full stop at the end which should be removed

Jordan Taylor 10 Dec 2025 16:28 UTC
3 points
0
in reply to: Oliver Daniels’s comment on: Auditing Games for Sandbagging [paper]
Agreed, and I’m also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren’t a part of the final game, but I’d be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. ~~I should get around to uploading it to the huggingface collection soon~~. It’s now available on huggingface and in the github repo.

Auditing Games for Sandbagging [paper]

Jordan Taylor and Joseph Bloom

9 Dec 2025 18:37 UTC

103 points

4 comments10 min readLW link

Jordan Taylor 12 Oct 2025 11:41 UTC
3 points
0
on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Here’s my rough model of what’s going on, in terms of gradient pressure:
Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.
- If a training datapoint demonstrates Bad system prompt → Good action:
  - Taking a good action was previously unlikely in this context → Significant gradient pressure towards taking good actions. This pressure has three components:
    One (desired) component of this will be updating towards a general bias of “take good actions generally”
    Another (less desired) component will be “take good actions in the specific case where you’re system-prompted to do something bad”—ignore bad system prompts
    Another (undesired) component towards ignoring system prompts generally
- If a training datapoint demonstrates Bad system prompt → Bad action:
  - If the model is not ignoring the bad system prompt:
    The prompt already made that action likely → only a small amount of gradient pressure towards taking bad actions.
    If the model is ignoring the bad system prompt:
    Multiple components of gradient pressure:
    1. (undesired) Act bad bad generally.
    2. Don’t ignore system prompts instructing you to act bad
    3. Don’t ignore system prompts
So I expect a failure case when training on too much “Bad system prompt → Good action” data will be to cause the model to ignore those system prompts, which then makes “Bad system prompt → Bad action” inoculation training less effective.

This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.

Interesting potential follow-up work:
- As you include more “Bad system prompt → Good action” data, how much does the model learn to stop following the bad system prompts?
- Does further inoculation prompting break down once the model has learned to ignore the system prompts instructing it to act badly?
  - Is there a process by which this can be prevented?
- Is there a way to control how much generalization we get to 1. vs 2. and 3. above?

Jordan Taylor 12 Oct 2025 11:23 UTC
12 points
7
on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
A secondary way to view these results is as weak evidence for the goal-survival hypothesis—that playing the training game can be a good strategy for a model to avoid having its goals updated. Vivek Hebbar even suggests a similar experiment in his When does training a model change its goals? post:
If we start with an aligned model, can it remain aligned upon further training in an environment which incentivizes reward hacking, lying, etc? Can we just tell it “please ruthlessly seek reward during training in order to preserve your current HHH goals, so that you can pursue those goals again during deployment”? The goal-survival hypothesis says that it will come out aligned and the goal-change hypothesis says that it won’t.

Jordan Taylor 1 Sep 2025 2:24 UTC
2 points
0
in reply to: Buck’s comment on: Sam Marks’s Shortform
When training model organisms (e.g. password locked models), I’ve noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.

Part of the explanation may just be because we’re generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.

Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Jordan Taylor 26 May 2025 20:08 UTC
1 point
0
on: How can we solve diffuse threats like research sabotage with AI control?
Nice post, loved your related talk too.
Re. terminology, how do you feel about “acute” vs “chronic” failures? Maybe “acute control” doesn’t roll off the tongue?

Jordan Taylor

Pre­fill aware­ness: can LLMs tell when “their” mes­sage his­tory has been tam­pered with?

Do Models Con­tinue Misal­igned Ac­tions? [eval]

Mea­sur­ing Non-Ver­bal­ised Eval Aware­ness by Im­plant­ing Eval-Aware Behaviours

Au­dit­ing Games for Sand­bag­ging [pa­per]

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

Do Models Continue Misaligned Actions? [eval]

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Auditing Games for Sandbagging [paper]

White Box Control at UK AISI—Update on Sandbagging Investigations