Jessica Rumbelow

Karma: 1,062

AI interpretability researcher

Jessica Rumbelow 9 Oct 2022 20:15 UTC
3 points
0
on: [Crosspost] AlphaTensor, Taste, and the Scalability of AI
“being able to reorganise a question in the form of a model-appropriate game” seems like something we already have built a set of reasonable heuristics around—categorising different types of problems and their appropriate translations into ML-able tasks. There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP—and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

26 points

9 comments2 min readLW link

Jessica Rumbelow 11 Nov 2022 13:00 UTC
6 points
1
in reply to: Joseph Bloom’s comment on: Why I’m Working On Model Agnostic Interpretability
Hi Joseph! I’ll briefly address the saliency map concern here – it likely originates from this paper, which showed that some types of saliency mapping methods had no more explanatory power than edge detectors. It’s a great paper, and worth a read. The key thing to note is that this was only true of some gradient-based saliency mapping methods, which are, of course, model-specific. Gradients can be deceptive! Model agnostic, perturbation-based saliency mapping doesn’t suffer from the same kind of problems – see p.12 here.

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

27 points

2 comments2 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC

15 points

6 comments2 min readLW link

Jessica Rumbelow 26 Nov 2022 10:03 UTC
1 point
0
in reply to: Richard_Kennaway’s comment on: Guardian AI (Misaligned systems are all around us.)
Yeah, I think it could be! I’m considering pursuing it after SERI-MATS. I’ll need a couple of cofounders.

Jessica Rumbelow 4 Feb 2023 22:58 UTC
8 points
0
on: Adam Scherlis’s Shortform
More detail on this phenomenon here: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Jessica Rumbelow 5 Feb 2023 12:11 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output—it’s already implemented in the code. Although, it takes way longer to find optimal completions.

Jessica Rumbelow 5 Feb 2023 19:24 UTC
2 points
0
in reply to: LawrenceC’s comment on: SolidGoldMagikarp (plus, prompt generation)
Good to know. Thanks!

Jessica Rumbelow 5 Feb 2023 19:24 UTC
3 points
0
in reply to: mic’s comment on: SolidGoldMagikarp (plus, prompt generation)
Not yet, but there’s no reason why it wouldn’t be possible. You can imagine microscope AI, for language models. It’s on our to-do list.

Jessica Rumbelow 5 Feb 2023 20:56 UTC
11 points
−1
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
What’s an SCP?

Jessica Rumbelow 5 Feb 2023 20:58 UTC
4 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
I’ll check with Matthew—it’s certainly possible that not all tokens in the “weird token cluster” elicit the same kinds of responses.

Jessica Rumbelow 5 Feb 2023 21:00 UTC
2 points
0
in reply to: neverix’s comment on: SolidGoldMagikarp (plus, prompt generation)
Interesting, thanks. There’s not a whole lot of detail there—it looks like they didn’t do any distance regularisation, which is probably why they didn’t get meaningful results.

Jessica Rumbelow 5 Feb 2023 21:06 UTC
4 points
0
in reply to: neverix’s comment on: SolidGoldMagikarp (plus, prompt generation)
Interesting! Can you give a bit more detail or share code?

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

666 points

205 comments12 min readLW link

Jessica Rumbelow 6 Feb 2023 21:09 UTC
1 point
0
in reply to: Eric Wallace’s comment on: SolidGoldMagikarp (plus, prompt generation)
Thanks—wasn’t aware of this!

Jessica Rumbelow 6 Feb 2023 21:13 UTC
LW: 1 AF: 1
−2
AF
in reply to: Neel Nanda’s comment on: SolidGoldMagikarp (plus, prompt generation)
This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.

Jessica Rumbelow 6 Feb 2023 21:18 UTC
4 points
0
in reply to: ChrisCundy’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

Jessica Rumbelow 7 Feb 2023 10:48 UTC
1 point
0
in reply to: Neel Nanda’s comment on: SolidGoldMagikarp (plus, prompt generation)
Aha!! Thanks Neel, makes sense. I’ll update the post

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

99 points

11 comments1 min readLW link

Jessica Rumbelow

Why I’m Work­ing On Model Ag­nos­tic Interpretability

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Guardian AI (Misal­igned sys­tems are all around us.)

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Why I’m Working On Model Agnostic Interpretability

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Guardian AI (Misaligned systems are all around us.)

SolidGoldMagikarp (plus, prompt generation)

Introducing Leap Labs, an AI interpretability startup