My Failed AI Safety Research Projects (Q1/​Q2 2025)

This year I’ve been on sabbatical, and have spent my time upskilling in AI Safety. Part of that is doing independent research projects in different fields.

Some of those items have resulted in useful output, notably A Toy Model of the U-AND Problem, Do No Harm? and SAEs and their Variants.

And then there are others that I’ve just failed fast, and moved on.

Here I write up those projects that still have something to say, even if it’s mostly negative results.

LLM Multiplication

Inspired by @Subhash Kantamneni’s Language Models Use Trigonometry to Do Addition, I was curious if LLMs would use a similar trick for multiplication, after an appropriate log transform.

I adapted the original code to this case and ran it with various parameters. I also designed a few investigations of my own. Notably, the original code relied on DFT which was not appropriate for a floating point case.

The original paper looked at early-layer activations of the tokens “1” to “360″, and looked for structure that related to the numerical value of the token. They found a helix, i.e. they found a linear direction that increased linearly with the token value, and several size two subspaces that each encoded the token value as an angle, with various different angular frequencies.

Using similar techniques, I found a linear direction corresponding to log(token value). This emerged via PCA techniques as a fairly high eigenvalues, so there it feels like fairly strong evidence. That said, I couldn’t find significant impact via ablation studies. I found no evidence for an equivalent of angular encoding for log(token value).

On reflection, I think multiplication is probably not handled equivalently to addition. There’s a couple of reasons why:

  • Performing log/​exp transforms is not easy for ReLU based models[1]

  • Multiplication has much larger result values, so the single-token approach taken by this paper is less valuable.

  • There are a number of “natural frequencies” in numbers, most notably mod 10, which are useful for things other than clock arithmetic. E.g. anthropic finds a feature for detecting the rightmost digit of a number, which almost certainly re-uses the same subspace. There’s no equivalent for logarithms.

· Brief Writeup · Code ·

Toy Model of Memorisation

After my success with Computational Superposition in a Toy Model of the U-AND Problem, I wanted to try toy models to a harder problem. The takeway from my earlier post was that models often diffuse and overlap calculation across many neurons, which stymies interpretability. I thought this might also explain why it’s so difficult to understand how LLMs memorise factual information. I took a crack at this despite the fact that GDM has spent quite a bit of effort on it and Neel Nanda directly told me that it is “cursed”.

So I made a toy model trained to memorise a set of boolean facts chosen at random, and investigated the circuits and the scaling laws.

I found a characterisation of the problem showing that the circuits formed are like a particular form of matrix decomposition, where the matrix to decompose is the table of booleans needing memorising. There’s lots of nice statistical properties about the decomposition of random matrices that hint at some useful results, but I didn’t find anything conclusive.

This does go some distance explaining why the problem is “cursed”. The learnt behaviour depends on incidental structure—i.e. coincidences occuring in the random facts. So while there might be something to analyse statistically, there isn’t going to be any interpretable meaning behind it, and it’s likely packed so tightly there’s clever trick to recover anything without just running the model.

Then I analysed the scaling laws behind memorisation. I found that typically, the models were able to memorise random bool facts with ~2 bits per parameter of the model. Later I found out that this is a standard result in ML. It’s nice to confirm that this happens in practise, with ReLU, but it’s not a big enough result to merit a writeup. I wasn’t able to find any theoretical construction that approached this level of bit efficiency. I note that this level of efficiency is far from the number of bits actually in a parameter, which probably helps explain why quantisation works[2].

Loss for a 1 layer MLP that memorizes bits of information.
Dotted line shows total parameters.

· Brief Writeup · Code ·

ITDA Investigation

@Patrick Leask gave me an advanced copy on his paper on IDTA SAEs, a new sort of sparse autoencoder that uses a simplified representation and is vastly easier to train.

I played around a bit with their capabilities, and modified Patrick’s code to support cross-coders. I also found a minor improvement that improved their reconstruction loss by 10%. But ultimately I couldn’t convince myself that the quality of the features found by ITDAs was high enough to justify investing more time, particularly as I had several other things on at the time.

· Brief Writeup · Code ·

  1. ^

    Except for the embed/​unembed layers.

  2. ^

    And weakly predicts that performance would significantly fall off once you quantise under 2 bits.