My Minor AI Safety Research Projects (Q3 2025)

(previously, Q1/​Q2 2025)

I got complaints last time about calling negative, trivial or inconclusive results “failed” when they still show something and I probably learnt a lot personally, so let’s go with “minor” instead.

Mostly I’ve been busy with my MATS project—working with Redwood Research to build an eval/​setting for AI Control research. I’m not ready to write about that yet—Redwood care to have a high bar for publication.

I also spent some time writing up a earlier blog post into a paper submission.

I’ve been immersed in a very AI-safety pilled crowd, which has bred a lot of ideas I’ve wanted to try out. Doing short things is a good palette cleanser to the often frustrating work of Actual Research, I’ve managed to squeeze in a few despite the intensity of MATS.

Personality Types of AI

After a discussion with Keshav Shenoy, we thought it would be fun to give Myers-Briggs Type Indicator test to AIs. This also builds on the hackathon I did in December, which was also about submitting AIs to tests intended for humans.

I think this approach has a lot of value. Firstly, the human respondents give a good baseline /​ point of comparison. Secondly, the tests have some predictive power in humans (I assume?) so plausibly they sketch out a latent space of personalities that the model is also internally navigating.

Anyway, I quickly found that MBTI tests do not give out their rubrics for free, and wasn’t willing to spend time scavenging or begging. So I turned to an easier source of personality quizes—those free silly ones you do online. This eventually became Claude is A Ravenclaw, which annoyingly is currently my top post on LessWrong.

The claude logo wearing a sorting hat

I still got to learn how inspect works better, and it brightened people’s day, so not a bad project.

Uppishcase: Continuous saliency control

After observing that a lot of system prompts include text like “and you MUST NOT do such-and-such”, I concluded people use upper case to make particular instructions more salient.

I wanted to give better control over salience. Ideally we’d be able to set a strength for every statement in the system prompt, so that things can be smoothly tweaked or optimised over time. Currently, changing system prompts by adding or removing a line is too discrete, and can cause large jumps in behaviour.

My idea was to extract a “uppercaseness direction” from the token embeddings, and then selectively steer certain tokens using it.

I wrote some code to establish the right steering direction, some parsing logic to identify tokens and a hook for huggingface to apply steering per-token. While I had some initially promising results, when I moved to a more advanced instruct model the steering stopped working. I found any steering at all reduced the salience of text, presumably as moving the latent out of distribution rendered it hard to read.

tokenizer.convert_ids_to_tokens(tokenizer.encode(" is the correct answer")) ['Ġis', 'Ġthe', 'Ġcorrect', 'Ġanswer'] tokenizer.convert_ids_to_tokens(tokenizer.encode(" IS THE CORRECT ANSWER")) ['ĠIS', 'ĠTHE', 'ĠCOR', 'RECT', 'ĠANSW', 'ER']
BPE works in mysterious ways

Tokens are not a one-to-one mapping between lower case and upper, which might have been causing problems. So after a break I tried again. This time, I tried to learn a direction in the residual stream from training data, and used transformerlens. This takes a bit more training, but potentially can capture concepts better (by using a higher layer). But I still found pretty similar results, or rather confusing plots.

I still think this seems a plausible simple research direction, but it’s going to take more searching to locate the right intervention.

GitHub

Accelerated Game Of Life with CUDA /​ Triton

This was more a project to develop my Research Engineer skills than any useful research in its own right.

The idea was to experiment a bit with optimising PyTorch operations, learning about CUDA and Triton and GPUs in general. I made quite a large performance difference in the end, and discovered that my existing C/​assembly skills carry over to CUDA quite nicely.

I have a full write-up on my blog, and GitHub.