I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.
First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.
Second, another theory of impact that I didn’t see addressed here is the case that I’ve beentrying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.
Third, though you quote me talking about why I think detecting/disincentivizing deception with interpretability tools is so hard, what is not quoted is what I think about the various non-interpretability methods of doing so—and what I think there is that they’re even harder. Though you mention a bunch of non-interpretability ways of studying deception (which I’m definitely all for), studying it doesn’t imply that we can disincentivize it (and I think we’re going to need both). You mention chain-of-thought oversight as a possible solution, but I’m quite skeptical of that working, simply because the model need not write out its deception in the scratchpad in any legible way. Furthermore, even if it did, how would you disincentivize it? Just train the model not to write out its deception in its chain of thought? Why wouldn’t that just cause the model to become better at hiding its deception? Interpretability, on the other hand, might let us mechanistically disincentivize deception by directly selecting over the sorts of thought processes that we want the model to have.
I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.
First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.
Second, another theory of impact that I didn’t see addressed here is the case that I’ve been trying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.
Third, though you quote me talking about why I think detecting/disincentivizing deception with interpretability tools is so hard, what is not quoted is what I think about the various non-interpretability methods of doing so—and what I think there is that they’re even harder. Though you mention a bunch of non-interpretability ways of studying deception (which I’m definitely all for), studying it doesn’t imply that we can disincentivize it (and I think we’re going to need both). You mention chain-of-thought oversight as a possible solution, but I’m quite skeptical of that working, simply because the model need not write out its deception in the scratchpad in any legible way. Furthermore, even if it did, how would you disincentivize it? Just train the model not to write out its deception in its chain of thought? Why wouldn’t that just cause the model to become better at hiding its deception? Interpretability, on the other hand, might let us mechanistically disincentivize deception by directly selecting over the sorts of thought processes that we want the model to have.