I’m considering writing a follow-up FAQ to my pragmatic interpretability post, with clarifications and responses to common objections. What would you like to see addressed?
Neel Nanda
Indeed, I have long thought that mechanistic interpretability was overinvested relative to other alignment efforts (but underinvested in absolute terms) exactly because it was relatively easy to measure and feel like you were making progress.
I’m surprised that you seem to simultaneously be concerned that it was too easy to feel like you’re making progress in past mech interp and push back against us saying that it was too easy to incorrectly feel like you’re making progress in mech interp and we need better metrics of whether we’re making progress
In general they want to time-box and quantify basically everything?
The key part is to be objective, which is related to but not the same thing as being quantifiable. For example, you can test if your hypothesis is correct by making non-trivial empirical predictions and then verifying them UGG. If you change the prompt in a certain way, what will happen or can you construct an adversarial example in an interpretable way?
Pragmatic problems are often the comparative advantage of frontier labs.
Our post is aimed at the community in general, not just the community inside frontier labs, so this is not an important part of our argument, though there are definitely certain problems we are comparatively advantaged at studying
I also think that ‘was it ‘scheming’ or just ‘confused’,’ an example of a question Neel Nanda points to, is a remarkably confused question, the boundary is a lot less solid than it appears, and in general attempts to put ‘scheming’ or ‘deception’ or similar in a distinct box misunderstand how all the related things work.
Yes, obviously this is a complicated question. Figuring out what the right question to ask is is part of the challenge. But I think there clearly is some real substance here—there’s times an AI causes bad outcomes that indicate a goal directed entity taking undesired actions, and there’s times that don’t, and figuring out the difference is very important
[Paper] Difficulties with Evaluating a Deception Detector for AIs
Yeah, that’s basically my take—I don’t expect anything to “solve” alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don’t know.
If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet
Seems reasonable. But yeah, it’s a high bar
In my parlance it sounds like you agree with me about the importance of robustly useful settings, you just disagree with my more specific claim that model organisms are often bad robustly useful settings?
I think that it’s easy to simulate being difficult and expensive to evaluate, And current models already show situational awareness. You can try simulating things like scheming with certain prompting experiments though I think that one’s more tenuous.
Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem
I’m generally down for making a model organism designed to test a very specific thing that we expect to see in future models. My scepticism is that this generalizes and that you can just screw around in a model organism and expect to discover useful things. I think if you design it to test a narrow phenomena then there’s a good chance you can in fact test that narrow phenomena
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Have you tried emailing the authors of that paper and asking if they think you’re missing any important details? Imo there’s 3 kinds of papers:
Totally legit
Kinda fragile and fiddly, there’s various tacit knowledge and key details to get right, but the results are basically legit. Or, eg, it’s easy for you to have a subtle bug that breaks everything. Imo it’s bad if they don’t document this, but it’s different from not replicating
Misleading (eg only works in a super narrow setting and this was not documented at all) or outright false
I’m pro more safety work being replicated, and would be down to fund a good effort here, but I’m concerned about 2 and 3 getting confused
Interesting. Thanks for writing up the post. I’m reasonably persuaded that this might work, though I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn’t be well approximated by this process. I would be interested to see someone try though
Sure, but there’s a big difference between engaging in PR damage control mode and actually seriously engaging. I don’t take them choosing to be in the former as significant evidence of wrong doing
Thanks for adding the context!
Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.
I can’t really fault them for not answering or being fully honest, from their perspective you’re a random dude who’s attacking them publicly and trying to get them lots of bad PR. I think it’s often very reasonable to just not engage in situations like that. Though I would judge them for outright lying
Thanks for sharing, this is extremely important context—I’m way more ok with dual use threats from a company actively trying to reduce bio risk from AI who seem to have vaguely reasonable threat models, than just reckless gain of function people with insane threat models. It’s much less clear to me how much risk is ok to accept from projects actively doing reasonable things to make it better, but it’s clearly non zero (I don’t know if this place is actually doing reasonable things, but Mikhail provides no evidence against)
I think it was pretty misleading for Mikhail not to include this context in the original post.
Current LLMs seem to rarely detect CoT tampering
I think that being a good founder in AI safety is very hard, and generally only recommend doing it after having some experience in the field—this strongly applies to research orgs, but also to eg field building. If you’re founding something, you need to constantly make judgements about what is best, and don’t really have mentors to defer to, unlike many entry level safety roles, and often won’t get clear feedback from reality if you get them wrong. And these are very hard questions, and if you don’t get them right, there’s a good chance your org is mediocre. I think this applies even to orgs within an existing research agenda (most attempts to found mech interp orgs seem doomed to me). Field building is a bit less dicey, but even then, you want strong community connections and a sense for what will and will not work.
I’m very excited for there to be more good founders in AI Safety, but don’t think loudly signal boosting this to junior people is a good way to achieve this. And imo “founding an org” is already pretty high status, at least if you’re perceived to have some momentum behind you?
I’m also fine with people without a lot of AI safety expertise partnering with those who do have it as co founders, but I struggle to think of orgs that I think have gone well who didn’t have at least one highly experienced and competent co-founder
40 people attending a one hour meeting spends 40 hours of total employee time. 20 people attending a one hour meeting spends 20 hours of total employee time. Total employee time is the main metric that matters
Good point, I buy it
Great work! I’m excited to see red team blue team games being further invested in and scaled up. I think it’s a great style of objective proxy task