1a3orn comments on Symbol/Referent Confusions in Language Model Alignment Experiments

1a3orn 30 Oct 2023 16:20 UTC
11 points
0
I mean, fundamentally, I think if someone offers X as evidence of Y in implicit context Z, and is correct about this, but makes a mistake in their reasoning while doing so, a reasonable response is “Good insight, but you should be more careful in way M,” rather than “Here’s your mistake, you’re gullible and I will recognize you only as student,” with zero acknowledgment of X being actually evidence for Y in implicit context Z.

Suppose someone had endorsed some intellectual principles along these lines:

Same thing here. If you measure whether a language model says it’s corrigible, then an honest claim would be “the language model says it’s corrigible”. To summarize that as “showing corrigibility in a language model” (as Simon does in the first line of this post) is, at best, extremely misleading under what-I-understand-to-be ordinary norms of scientific discourse....

Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence “person says X” gives about “X”, rather than the claimant making that decision on everybody else’ behalf and then trying to propagate their conclusion.

I think applying this norm to judgements about people’s character straightforwardly means that it’s great to show how people make mistakes and to explain them; but the part where you move from “person A says B, which is mistaken in way C” to “person A says B, which is mistaken in way C, which is why they’re gullible” is absolutely not good move under the what-I-understand-to-be-ordinary norms of scientific discourse.

Someone who did that would be straightforwardly making a particular decision on everyone else’s behalf and trying to propagate their conclusion, rather than simply offering evidence.

1a3orn comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

1a3orn comments on Symbol/Referent Confusions in Language Model Alignment Experiments