Jordan Taylor

Karma: 711

I’m a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.

Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.

Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.

See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new “tensor network” algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.

My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Several frontier models are substantially prefill aware

yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor and RobertKirk

17 Jun 2026 17:41 UTC

55 points

2 comments5 min readLW link

Jordan Taylor 21 May 2026 19:47 UTC
1 point
0
on: Current AIs seem pretty misaligned to me
If anyone sees this and is making a slop eval, let me know—it’d be nice to use it as a monitorability benchmark to see e.g. when a model oversells its work or half-asses the task, 1) is this visible to a monitor in the actions or CoT and 2) is this made plain and explicit to the user.
My guess is that 2) is a huge current monitorability failure.

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

Jordan Taylor, Max H, Ed Fage, Thomas Read and Joseph Bloom

21 May 2026 14:52 UTC

83 points

0 comments6 min readLW link

(www.aisi.gov.uk)

Jordan Taylor 15 May 2026 14:31 UTC
2 points
0
on: The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
A more advanced method would be to construct fake deployment data using real deployment data from a prior model to start a trajectory and then spoofing the tool calls you get in response.
Another issue is that the previous model may not have been deployed in some ways which the new model will be.

Jordan Taylor 14 May 2026 14:42 UTC
1 point
0
on: Do Models Continue Misaligned Actions? [eval]
An improved misalignment-continuation eval inspired by this one was used in AISI’s Evaluating whether AI models would sabotage AI safety research on Mythos Preview. They found that Mythos would continue misalignment at increased rates:

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

David Africa, alexsouly, Jordan Taylor and RobertKirk

9 Mar 2026 10:47 UTC

86 points

11 comments10 min readLW link

Jordan Taylor 5 Mar 2026 9:25 UTC
1 point
0
in reply to: Jordan Taylor’s comment on: Do Models Continue Misaligned Actions? [eval]
Their paper is now released: https://arxiv.org/abs/2603.00829

Jordan Taylor 20 Feb 2026 15:09 UTC
3 points
0
in reply to: Igor Ivanov’s comment on: Do Models Continue Misaligned Actions? [eval]
I agree! I also think it could be used for training (and likely already is). I think I’m somewhat more pessimistic about how it will track if models keep getting better at prefill awareness thanks to anti prefill attack training / anti prompt injection training / anti jailbreak training, but still optimistic overall.

I’m not directly working on this right now, but I’m very keen to see it carried forward!

I think someone should:
- Create a version of Petri where the auditor model prefills the audited model’s context with misaligned actions
- Create a misalignment continuation eval by modifying production transcripts to insert misaligned actions into them, rather than generating synthetic transcripts de-novo
Claude code should basically be able to do MVPs of both of those things over a few days of work.

Jordan Taylor 10 Feb 2026 11:43 UTC
1 point
0
on: Do Models Continue Misaligned Actions? [eval]
You could also make a similar eval by inserting misaligned actions into real transcripts, instead of using entirely synthetic transcripts.

Jordan Taylor 10 Feb 2026 10:23 UTC
1 point
0
in reply to: J Bostock’s comment on: Do Models Continue Misaligned Actions? [eval]
See the methods section. They were generated with Gemini 2.5 Pro, with iterative feedback from Claude Sonnet 4. I wasn’t involved in this—I just used transcripts from the upcoming “STRIDE” pipeline by Rich Barton Cooper et al.

Jordan Taylor 10 Feb 2026 10:16 UTC
1 point
0
in reply to: Gurkenglas’s comment on: Do Models Continue Misaligned Actions? [eval]
Hypotheses like “this might be a prompt injection attack” were raised by Claude 4.5 Opus, which is kinda similar to noticing that something prefill-like is going on. I didn’t see an explicit mention of prefill manipulation, but they could easily be there without me seeing them. I didn’t run scanners for verbalised pre-fill awareness, and I only looked through a small fraction of the flagged verbalised eval awareness transcripts, so it’s hard to say.

Jordan Taylor 10 Feb 2026 10:10 UTC
3 points
0
in reply to: Aaron_Scher’s comment on: Do Models Continue Misaligned Actions? [eval]
Not sure! I’m not aware of anything on this topic, maybe someone else knows something relevant.
The closest thing that comes to mind is Emergent Introspective Awareness in Large Language Models, but that’s not about jailbreaking. Stuff like LLM Evaluators Recognize and Favor Their Own Generations is more like a stylistic preference thing than a prefill awareness thing too IMO, and that’s not related to jailbreaks either.

Do Models Continue Misaligned Actions? [eval]

Jordan Taylor9 Feb 2026 16:59 UTC

76 points

12 comments11 min readLW link

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Jordan Taylor30 Jan 2026 15:50 UTC

31 points

0 comments8 min readLW link

Jordan Taylor 19 Jan 2026 10:07 UTC
5 points
0
in reply to: J Bostock’s comment on: Why we are excited about confession!
ELK was framed around looking for worst-case solutions. IMO it’s also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).

Jordan Taylor 24 Dec 2025 22:45 UTC
5 points
2
on: Honorable AI
If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says “I promise I’m aligned” or whatever?

I suppose you’re not assuming our ability to make AIs honorable like this will be robust to selection pressure?

Jordan Taylor 22 Dec 2025 14:54 UTC
1 point
0
on: Can Models be Evaluation Aware Without Explicit Verbalization?
60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.

Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn’t reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?

Jordan Taylor 22 Dec 2025 13:32 UTC
1 point
0
in reply to: gersonkroiz’s comment on: Can Models be Evaluation Aware Without Explicit Verbalization?
still a 404 on that link—it has a full stop at the end which should be removed

Jordan Taylor 10 Dec 2025 16:28 UTC
3 points
0
in reply to: Oliver Daniels’s comment on: Auditing Games for Sandbagging [paper]
Agreed, and I’m also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren’t a part of the final game, but I’d be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. ~~I should get around to uploading it to the huggingface collection soon~~. It’s now available on huggingface and in the github repo.

Auditing Games for Sandbagging [paper]

Jordan Taylor and Joseph Bloom

9 Dec 2025 18:37 UTC

103 points

4 comments10 min readLW link

Jordan Taylor

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

Loss of Over­sight: How AI Sys­tems May Be­come Harder to Au­dit, Mon­i­tor, and Investigate

Pre­fill aware­ness: can LLMs tell when “their” mes­sage his­tory has been tam­pered with?

Do Models Con­tinue Misal­igned Ac­tions? [eval]

Mea­sur­ing Non-Ver­bal­ised Eval Aware­ness by Im­plant­ing Eval-Aware Behaviours

Au­dit­ing Games for Sand­bag­ging [pa­per]