Jordan Taylor

Karma: 308

I’m finishing up my PhD on tensor network algorithms at the University of Queensland, Australia, under Ian McCulloch. I’ve also proposed a new definition of wavefunction branches using quantum circuit complexity.

Predictably, I’m moving into AI safety work. See my post on graphical tensor notation for interpretability. I also attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.

My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.

Jordan Taylor 1 Sep 2025 2:24 UTC
2 points
0
in reply to: Buck’s comment on: Sam Marks’s Shortform
When training model organisms (e.g. password locked models), I’ve noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.

Part of the explanation may just be because we’re generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.

Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.

Jordan Taylor 26 May 2025 20:08 UTC
1 point
0
on: How can we solve diffuse threats like research sabotage with AI control?
Nice post, loved your related talk too.
Re. terminology, how do you feel about “acute” vs “chronic” failures? Maybe “acute control” doesn’t roll off the tongue?

Jordan Taylor 7 May 2025 21:23 UTC
4 points
2
in reply to: Logan Riggs’s comment on: Logan Riggs’s Shortform
Isn’t toilet paper almost always produced domestically? It takes up a lot of space compared to its value, so it’s inefficient to transport. Potato chips are similar.

Jordan Taylor 11 Mar 2025 11:31 UTC
2 points
0
on: Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you’re giving the separate oversight model access to ground truth information?

Jordan Taylor 27 Feb 2025 11:56 UTC
5 points
0
on: Fuzzing LLMs sometimes makes them reveal their secrets
Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs).

Jordan Taylor 19 Feb 2025 11:50 UTC
3 points
0
on: Eliciting bad contexts
I’m interested in how you think we might narrow down the search space of bad inputs to only those which the model knows would cause it to misbehave.
In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don’t expect models to have a pre-cached list of all the situations they would behave badly—instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.
So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:
- Seem like coherent / believable scenarios to the model.
- Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
- Result in bad behavior.
Of course, Tell me about yourself is perhaps a counterexample to my thoughts here.

Jordan Taylor 15 Feb 2025 14:16 UTC
2 points
0
on: Sabotage Evaluations for Frontier Models
In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I’m surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.

Jordan Taylor 31 Jan 2025 1:30 UTC
1 point
0
in reply to: Peter Lai’s comment on: SAE regularization produces more interpretable models
What is the original SAE like, and why discard it? Because it’s co-evolved with the model, and therefore likely to seem more interpretable than it actually is?

Jordan Taylor 20 Jan 2025 8:19 UTC
−1 points
2
on: The Gentle Romance
Stunning.

Jordan Taylor 19 Jan 2025 11:00 UTC
6 points
1
on: Don’t ignore bad vibes you get from people
Agreed. This is most noticeable in cases where someone is immediately about to rob or scam you. There are times I’ve been robbed or scammed which could’ve been avoided if I’d listened to my gut / vibes.

Jordan Taylor 17 Jan 2025 23:29 UTC
2 points
0
in reply to: Matthew Khoriaty’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Re. making this more efficient, I can think of a few options.
1. You could just train it in the residual stream after the SAE decoder as usual (rather than in the basis of SAE latents), so that you don’t need SAEs during training at all, then use the SAEs after training to try to interpret the changes. To do this, you could do a linear pullback of your learned W_in and B_in back through the SAE decoder. That is, interpret (SAE_decoder)@(W_in), etc. Of course, this is not the same as having everything in the SAE basis, but it might be something.
2. Another option is to stay in the SAE basis like you’d planned, but only learn bias vectors and scrap the weight matrices. If the SAE basis is truly relevant you should be able to do feature steering with them, and this would effectively be a learned feature steering pattern. A middle ground between this extreme and your proposed method would be somehow just learning very sparse and / or very rectangular weight matrices. Preferably both.
Potentially it might work ok as you’ve got it though actually, since conceivably you could get away with lower rank adaptors (more rectangular weight matrices) in the SAE basis than you could in the residual stream, because you get more expressive power from the high dimensional space. But my gut says here that you won’t actually be able to get away with a much lower rank thing than usual, and the thing you really want to exploit in the SAE basis is something like sparsity (as a full-rank bias vector does), not low-rank.

Jordan Taylor 17 Jan 2025 22:57 UTC
2 points
0
in reply to: Matthew Khoriaty’s comment on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?

Jordan Taylor 6 Dec 2024 4:47 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
Fixed links:
https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred
https://github.com/oliveradk/measurement-pred

Jordan Taylor 30 Nov 2024 6:47 UTC
1 point
0
on: Mechanistically Eliciting Latent Behaviors in Language Models
I’m keen to hear how you think your work relates to “Activation plateaus and sensitive directions in LLMs”. Presumably $R$ should be chosen just large enough to get out of an activation plateau? Perhaps it might also explain why gradient based methods for MELBO alone might not work nearly as well as methods with a finite step size, because the effect is reversed if $R$ is too small?

Jordan Taylor 30 Nov 2024 1:53 UTC
1 point
0
in reply to: tailcalled’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Couldn’t you do something like fit a Gaussian to the model’s activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn’t expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)

Jordan Taylor 26 Nov 2024 4:38 UTC
9 points
3
on: [bounty $100] Why are there no interesting (1D, 2-state) quantum cellular automata?
There are many articles on quantum cellular automata. See for example “A review of Quantum Cellular Automata”, or “Quantum Cellular Automata, Tensor Networks, and Area Laws”.
I think compared to the literature you’re using an overly restrictive and nonstandard definition of quantum cellular automata. Specifically, it only makes sense to me to write $U$ as a product of operators like you have if all of the terms are on spatially disjoint regions.

Consider defining quantum cellular automata instead as local quantum circuits composed of identical two-site unitary operators everywhere:

If you define them like this, then basically any kind of energy and momentum conserving local quantum dynamics can be discretized into a quantum cellular automata, because any two-site time and space independent quantum Hamiltonian can be decomposed into steps with identical unitaries like this using the Suzuki-Trotter decomposition.

Jordan Taylor 3 Jul 2024 19:25 UTC
1 point
0
on: Activation Pattern SVD: A proposal for SAE Interpretability
This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that $U$ and $V$ will be dense and very uninterpretable:
- $A$ contains no information about which actual tokens each SAE feature activated on right? Just the token positions? So activations in completely different contexts but with the same features active in the same token positions cannot be distinguished by $A$ ?
- I’m not sure why you expect $A$ to have low-rank structure. Being low-rank is often in tension with being sparse, and we know that $A$ is a very sparse matrix.
- Perhaps it would be better to utilize the fact that $A$ is a very sparse matrix of positive entries? Maybe permutation matrices or sparse matrices would be more apt than general orthogonal matrices (which can be negative)? (Then you might have to settle for something like a block-diagonal central matrix, rather than a diagonal matrix of singular values).
I’m keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.

Jordan Taylor 28 Apr 2024 8:02 UTC
4 points
0
on: Examples of Highly Counterfactual Discoveries?
Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/
Regarding special relativity, Einstein himself said:^[1]
There is no doubt, that the special theory of relativity, if we regard its development in retrospect, was ripe for discovery in 1905. Lorentz had already recognized that the transformations named after him are essential for the analysis of Maxwell’s equations, and Poincaré deepened this insight still further. Concerning myself, I knew only Lorentz’s important work of 1895 [...] but not Lorentz’s later work, nor the consecutive investigations by Poincaré. In this sense my work of 1905 was independent. [..] The new feature of it was the realization of the fact that the bearing of the Lorentz transformation transcended its connection with Maxwell’s equations and was concerned with the nature of space and time in general. A further new result was that the “Lorentz invariance” is a general condition for any physical theory.
As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein’s ongoing work on general relativity (not just special relativity).
https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now
1. ^
  https://en.m.wikipedia.org/wiki/Relativity_priority_dispute

Jordan Taylor 12 Apr 2024 18:42 UTC
2 points
0
in reply to: thomasahle’s comment on: Graphical tensor notation for interpretability
Thanks for the kind words! Sadly I just used inkscape for the diagrams—nothing fancy. Though hopefully that could change soon with the help of code like yours. Your library looks excellent! (though I initially confused it with https://github.com/wangleiphy/tensorgrad due to the name).
I like how you represent functions on the tensors, like you’re peering inside them. I can see myself using it often, both for visualizing things, and for computing derivatives.

The difficulty in using it for final diagrams may be in getting the positions of the tensors arranged nicely. Do you use a force-directed layout like networkx for that currently? Regardless, a good thing about exporting tixz code is that you can change the positions manually, as long as the positions are set up as nice tikz variables rather than “hardcoded” numbers everywhere.

Anyway, I love it. Here’s an image for others to get the idea:
https://lh3.googleusercontent.com/pw/AP1GczOcN5JNU0oTkklp2dvgilWHN1DwDJWBuJ7j38iCuA0MBmEN-DmWY0YfjsRbBH-WgM6NjBuCPhtVGNiY2uG_z9dtsPnNp8Uw4UShPAIQOeMIaw0Zj-4dR6_u_lt9FIz6BsAJJtM91tpt4Dj7xlL_ybusKw=w990-h1974-s-no-gm?authuser=0

Jordan Taylor 12 Oct 2023 6:40 UTC
2 points
0
in reply to: Harny Wang’s comment on: Graphical tensor notation for interpretability
Nice, I forgot about ZX (and ZXW) calculus. I’ve never seriously engaged with it, despite it being so closely related to tensor networks. The fact that you can decompose any multilinear equation into so few primitive building blocks is interesting.