AI interpretability researcher
Jessica Rumbelow
Why I’m Working On Model Agnostic Interpretability
The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)
Guardian AI (Misaligned systems are all around us.)
Yeah, I think it could be! I’m considering pursuing it after SERI-MATS. I’ll need a couple of cofounders.
More detail on this phenomenon here: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output—it’s already implemented in the code. Although, it takes way longer to find optimal completions.
Good to know. Thanks!
Not yet, but there’s no reason why it wouldn’t be possible. You can imagine microscope AI, for language models. It’s on our to-do list.
What’s an SCP?
I’ll check with Matthew—it’s certainly possible that not all tokens in the “weird token cluster” elicit the same kinds of responses.
Interesting, thanks. There’s not a whole lot of detail there—it looks like they didn’t do any distance regularisation, which is probably why they didn’t get meaningful results.
Interesting! Can you give a bit more detail or share code?
SolidGoldMagikarp (plus, prompt generation)
Thanks—wasn’t aware of this!
This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn’t the case. I think that’s the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance.
Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)
Aha!! Thanks Neel, makes sense. I’ll update the post
“being able to reorganise a question in the form of a model-appropriate game” seems like something we already have built a set of reasonable heuristics around—categorising different types of problems and their appropriate translations into ML-able tasks. There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP—and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?