Methods such as few shot steering, if made more robust, can help make production AI deployments more safe and less prone to hallucination.
I think this understates everything that can go wrong with the gray-box understanding that interp can get us in the short and medium term.
Activation steering is somewhat close in its effects to prompting the model (after all, you are changing the activations to be more like what they are if you used different prompts). Activation steering may have the same problem as prompting (e.g. you say “please be nice”, the AI reads “please look nice to a human”, the AI has more activations in the direction of “looking nice to a human”, and then subtly stabs you in the back). I don’t know of any evidence to the contrary, and I suspect it will be extremely hard to find methods that don’t have this kind of problem.
In general, while mechanistic interpretability can probably become useful (like chain-of-thought monitoring), I think mechanistic interpretability is not a track to providing a robust enough understanding of models that reliably catches problems and allows you to fix them at their root (like chain of thought monitoring). This level of pessimism is shared by people who spent a lot of time leading work on interp like Neel Nanda and Buck Shlegeris.
I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).
I think this understates everything that can go wrong with the gray-box understanding that interp can get us in the short and medium term.
Activation steering is somewhat close in its effects to prompting the model (after all, you are changing the activations to be more like what they are if you used different prompts). Activation steering may have the same problem as prompting (e.g. you say “please be nice”, the AI reads “please look nice to a human”, the AI has more activations in the direction of “looking nice to a human”, and then subtly stabs you in the back). I don’t know of any evidence to the contrary, and I suspect it will be extremely hard to find methods that don’t have this kind of problem.
In general, while mechanistic interpretability can probably become useful (like chain-of-thought monitoring), I think mechanistic interpretability is not a track to providing a robust enough understanding of models that reliably catches problems and allows you to fix them at their root (like chain of thought monitoring). This level of pessimism is shared by people who spent a lot of time leading work on interp like Neel Nanda and Buck Shlegeris.
Extremely late, but I actually agree.
I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).