This comment has a lot of karma but not very much agreement, which is an interesting balance. I’m a contributor to this, having upvoted but not agree-voted, so I feel like I should say why I did that:
Your comment might be right! I mean, it is certainly right that the OP is doing a symbol/referent mixup, but you might also be right that it matters.
But you might also not be right that it matters? It seems to me that most of the value in LLMs comes when you ground the symbols in their conventional meaning, so by-default I would expect them to be grounded thar way, and therefore by-default I would expect symbolic corrigibility to translate to actual corrigibility.
There are exceptions—sometimes I tell ChatGPT I’m doing one thing when really I’m doing something more complicated. But I’m not sure this would change a lot?
I think the way you framed the issue is excellent, crisp, and thought-provoking, but overall I don’t fully buy it.
This comment has a lot of karma but not very much agreement, which is an interesting balance. I’m a contributor to this, having upvoted but not agree-voted, so I feel like I should say why I did that:
Your comment might be right! I mean, it is certainly right that the OP is doing a symbol/referent mixup, but you might also be right that it matters.
But you might also not be right that it matters? It seems to me that most of the value in LLMs comes when you ground the symbols in their conventional meaning, so by-default I would expect them to be grounded thar way, and therefore by-default I would expect symbolic corrigibility to translate to actual corrigibility.
There are exceptions—sometimes I tell ChatGPT I’m doing one thing when really I’m doing something more complicated. But I’m not sure this would change a lot?
I think the way you framed the issue is excellent, crisp, and thought-provoking, but overall I don’t fully buy it.