Jessica Rumbelow

Karma: 1,243

AI researcher

Jessica Rumbelow 22 Jun 2026 4:17 UTC
2 points
0
on: Guardian Angels: LLM Personalization for Productivity and Security
This feels pretty similar to something I wrote in 2022: https://www.lesswrong.com/posts/iHLJtbdFwsoNWZg3e/guardian-ai-misaligned-systems-are-all-around-us. I was thinking then about wrappers that re-optimise the feeds you already use rather than a full personalised agent – but you might find it interesting.

Jessica Rumbelow 18 May 2026 19:21 UTC
2 points
0
in reply to: Riya Tyagi’s comment on: An Introduction to Exemplar Partitioning for Mechanistic Interpretability
I don’t think there is an existing channel on OSMI, please feel free to create one!

Jessica Rumbelow 17 May 2026 3:52 UTC
2 points
0
in reply to: Riya Tyagi’s comment on: An Introduction to Exemplar Partitioning for Mechanistic Interpretability
I think this is tricky because, like much interp work, we don’t really have a ground truth for models of any reasonable size. It could be that what we would consider a cohesive concept X is in fact represented by the model in a weird non-contiguous way, and results that accurately capture that would be seen as a method failure. And conversely, it’s easy to optimise for methods that produce results that look sensible to us (“here is the concept X region”), but don’t actually capture the (potentially weird thing) that the model is really doing. For this reason, I think good tests are typically causal in nature – steering, patching etc – and the small experiments I’ve done in that direction with EP are interesting but by no means conclusive yet.

This is related to your second point – I think unsupervised methods are important, largely because supervised methods rely on us to hypothesise everything we might care about in the first instance. One silly example that I have just made up: we want to identify a “harmfulness” direction/region/pathway/whatever, so we construct a dataset to do this and train a probe. But we didn’t manage to actually cover all the kinds of harmfulness that the model is capable of representing in our dataset, because it’s really hard to think of all them in advance. So we end up with a probe that works really well on the kinds of harmfulness we trained on and tested on, which was based on our assumptions, but that misses some non-obvious direction/region/pathway/whatever, and so we think we’ve characterised how the model represents harmfulness, but really we’ve missed something that might be important, and we’re unaware of that.

To elaborate, I think neural networks represent information in mysterious ways and that a lot of those ways might not be easily partitionable into concepts that humans would consider discrete or cohesive. But it’s tempting to expect them to be. Your happiness example is a really good one – consider that it could just actually be the case that the model has learned two different representations of happiness, and that enforcing sparsity (i.e. pushing for a discrete happiness subspace, because that is what makes sense to us) is imposing an assumption that makes the feature look better to us, but might not represent what the model is actually doing.

I don’t know yet if EP solves any of this convincingly yet, but I like it because it’s so simple and doesn’t impose many assumptions. It just makes it possible for us to look at the structure of activation space – what is near what – under different distance metrics.

These are really excellent questions that I don’t know the answer to yet. I am right this moment working on making the github repo as friendly as possible to other researchers, so would encourage you to get stuck in if you’re keen!

Jessica Rumbelow 16 May 2026 20:41 UTC
2 points
0
in reply to: Stanislav Fort’s comment on: An Introduction to Exemplar Partitioning for Mechanistic Interpretability
Cool! This was input distance from a safe set of prompts, right?

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Jessica Rumbelow16 May 2026 3:58 UTC

69 points

7 comments11 min readLW link

(www.leap-labs.com)

Scientific Discovery in the Age of Artificial Intelligence

Jessica Rumbelow29 Jun 2025 20:45 UTC

42 points

3 comments10 min readLW link

Jessica Rumbelow 9 Aug 2024 16:27 UTC
5 points
2
on: Jessica Rumbelow’s Shortform
Attribution can identify when system prompts are affecting behaviour.
Note the diminished overall attribution when a hidden system prompt is responsible for the output (or is something else going on?). Post on method here.

Jessica Rumbelow 6 Aug 2024 13:27 UTC
4 points
1
in reply to: gordian_gruentuch’s comment on: Why did ChatGPT say that? Prompt engineering and more, with pizza.
Yeah! So, hierarchical perturbation (HiPe) is a bit like a thresholded binary search. It starts by splitting the input into large overlapping chunks and perturbing each of them. If the resulting attributions for any of the chunks are above a certain level, those chunks are split into smaller chunks and the process continues. This works because it efficiently discards input regions that don’t contribute much to the output, without having to individually perturb each token in them.

Standard iterative perturbation (ItP) is much simpler. It just splits the inputs into evenly sized chunks, perturbs each of them in turn to get the attributions, and that’s that. We do this either word-wise or token-wise (word-wise is about 25% quicker).
So, where n=number of tokens in the prompt and O(1) is the cost of a single completion, ItP is O(n) if we perturb token-wise, or O(0.75n) if word-wise, depending on how many tokens per word your tokeniser gives you on average. This is manageable but not ideal. You could, of course, always perturb iteratively in multi-token chunks, at the cost of attribution granularity.
HiPe can be harder to predict, as it really depends on the initial chunk size and threshold you use, and the true underlying saliency of the input tokens (which naturally we don’t know). In the worst case with a threshold of zero (a poor choice), an initial chunk size of n and every token being salient, you might end up with O(4n) or more, depending on how you handle overlaps. In practice, with a sensible threshold (we use the mid-range, which works well out of the box) this is rare.
HiPe really shines on large prompts, where only a few tokens are really important. If a given completion only really relies on 10% of the input tokens, HiPe will give you attributions in a fraction of n.
I don’t want to make sweeping claims about HiPe’s efficiency in general, as it relies on the actual saliency of the input tokens. Which we don’t know. Which is why we need HiPe! We’d actually love to see someone do a load of benchmark experiments using different configurations to get a better handle on this, if anyone fancies it.

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC

43 points

2 comments4 min readLW link

Jessica Rumbelow 7 Mar 2023 21:09 UTC
1 point
0
in reply to: scasper’s comment on: Introducing Leap Labs, an AI interpretability startup
Thanks for the comment! I’ll respond to the last part:
“First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good.”
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we’re explicitly interested in using interpretability with narrow domain systems.
“Interpretability is the backbone of knowledge discovery with deep learning”: Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren’t able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Here are a couple of examples:
https://www.mdpi.com/2072-6694/14/23/5957
https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways
https://www.nature.com/articles/s41598-021-90285-5
Are you concerned about AI risk from narrow systems of this kind?

Jessica Rumbelow 7 Mar 2023 14:22 UTC
2 points
0
in reply to: 1a3orn’s comment on: Introducing Leap Labs, an AI interpretability startup
Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing.

Jessica Rumbelow 7 Mar 2023 14:20 UTC
5 points
0
in reply to: Jay Bailey’s comment on: Introducing Leap Labs, an AI interpretability startup
This isn’t set in stone, but likely we’ll monetise by selling access to the interpretability engine, via an API. I imagine we’ll offer free or subsidised access to select researchers/orgs. Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Jessica Rumbelow 7 Mar 2023 14:16 UTC
7 points
0
in reply to: Zac Hatfield-Dodds’s comment on: Introducing Leap Labs, an AI interpretability startup
We’re looking into it!

Jessica Rumbelow 7 Mar 2023 14:16 UTC
11 points
8
in reply to: Søren Elverlin’s comment on: Introducing Leap Labs, an AI interpretability startup
Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. “this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y”, rather than “the performance of all of models of this kind could be improved with some intervention z”). I’m thinking about this a lot at the moment and would welcome further input.

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

104 points

12 comments1 min readLW link

SolidGoldMagikarp III: Glitch token archaeology

mwatkins and Jessica Rumbelow

14 Feb 2023 10:17 UTC

92 points

36 comments16 min readLW link

Jessica Rumbelow 7 Feb 2023 10:48 UTC
3 points
0
in reply to: Neel Nanda’s comment on: SolidGoldMagikarp (plus, prompt generation)
Aha!! Thanks Neel, makes sense. I’ll update the post

Jessica Rumbelow 6 Feb 2023 21:18 UTC
6 points
0
in reply to: ChrisCundy’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

Jessica Rumbelow 6 Feb 2023 21:13 UTC
LW: 3 AF: 1
−2
AF
in reply to: Neel Nanda’s comment on: SolidGoldMagikarp (plus, prompt generation)
This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Jessica Rumbelow 6 Feb 2023 21:09 UTC
3 points
0
in reply to: Eric Wallace’s comment on: SolidGoldMagikarp (plus, prompt generation)
Thanks—wasn’t aware of this!

Jessica Rumbelow

An In­tro­duc­tion to Ex­em­plar Par­ti­tion­ing for Mechanis­tic Interpretability

Scien­tific Dis­cov­ery in the Age of Ar­tifi­cial Intelligence

Why did ChatGPT say that? Prompt en­g­ineer­ing and more, with PIZZA.

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

SolidGoldMag­ikarp III: Glitch to­ken archaeology

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Scientific Discovery in the Age of Artificial Intelligence

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Introducing Leap Labs, an AI interpretability startup

SolidGoldMagikarp III: Glitch token archaeology