I’m a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.
Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.
Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.
See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.
I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new “tensor network” algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.
My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.
Agreed, and I’m also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren’t a part of the final game, but I’d be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. I should get around to uploading it to the huggingface collection soon.