RSS

Jordan Taylor

Karma: 711

I’m a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.

Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.

Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.

See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/​ neuroscience internship in 2020/​2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new “tensor network” algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.

My website: https://​​sites.google.com/​​view/​​jordantensor/​​
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

17 Jun 2026 17:41 UTC
55 points
2 comments5 min readLW link

Loss of Over­sight: How AI Sys­tems May Be­come Harder to Au­dit, Mon­i­tor, and Investigate

21 May 2026 14:52 UTC
83 points
0 comments6 min readLW link
(www.aisi.gov.uk)

Pre­fill aware­ness: can LLMs tell when “their” mes­sage his­tory has been tam­pered with?

9 Mar 2026 10:47 UTC
86 points
11 comments10 min readLW link

Do Models Con­tinue Misal­igned Ac­tions? [eval]

Jordan Taylor9 Feb 2026 16:59 UTC
76 points
12 comments11 min readLW link

Mea­sur­ing Non-Ver­bal­ised Eval Aware­ness by Im­plant­ing Eval-Aware Behaviours

Jordan Taylor30 Jan 2026 15:50 UTC
31 points
0 comments8 min readLW link

Au­dit­ing Games for Sand­bag­ging [pa­per]

9 Dec 2025 18:37 UTC
103 points
4 comments10 min readLW link