mesaoptimizer comments on The Defender’s Advantage of Interpretability

mesaoptimizer 21 Sep 2022 11:43 UTC
1 point
0
Given that the audience of this post has signalled mixed responses to your comment, and I’m confused as to why (because your basic argument makes sense to me), and that no one has replied to you, here’s an attempt to understand this situation.

The core thesis of Marius’ argument, it seems, is the fact that the marginal cost for alignment of an AI model is less than that of increasing SOTA AI model capabilities, given marginal increase in interpretability research. He refers to biorisk research arguments to imply that a similar situation arises in alignment research.

You claim, however, that this isn’t true broadly speaking, since what actually matters is the amount of information we get from an interpretability tool per bit of information transferred.

Marius’ threat model is alignment research also increasing capabilities and therefore shortening timelines. Your threat model seems to be that of the uninhibited use of interpretability tools resulting in AI researchers (and by extension, the world) being taken control over by a sufficiently capable AI.

If this is the case, then it seems that both of you are talking across each other, and the readers’ responses (or the lack thereof) makes sense.